Ping a Specific Port

Question

Nikita Kipriyanov

Asked: 2019-12-04 22:53:12 +0800 CST2019-12-04 22:53:12 +0800 CST 2019-12-04 22:53:12 +0800 CST

BareOS BackupCatalog job stuck in the Director as terminated, RunAfterJob did not run

772

This looks very strange. I use Bacula and now BareOS for more than 10 years, but now one system experiences strange behaviour and I can't find out why and how to fix.

When it runs daily backups, it works fine, until it reaches the BackupCatalog job, which is configured to run after everything else.

This job looks like it was successfully terminated (JobStatus=T in the list jobs table):

*list jobs
...
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+
| JobId | Name          | Client       | StartTime           | Type | Level | JobFiles | JobBytes        | JobStatus |
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+
...
| 5,475 | BackupCatalog | kantor-fd    | 2019-12-04 02:56:40 | B    | F     |       21 |      27,364,860 | T         |
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+

However, in messages and in the log file I don't see the usual summary for this last job. The log file finishes like this:

19-Nov 02:32 kantor-dir JobId 5398: shell command: run BeforeJob "/usr/lib/bareos/scripts/make_catalog_backup.pl Kantor"
19-Nov 02:33 kantor-dir JobId 5398: Start Backup JobId 5398, Job=BackupCatalog.2019-11-18_23.10.00_10
19-Nov 02:33 kantor-dir JobId 5398: Using Device "FileStorage" to write.
19-Nov 02:33 kantor-sd JobId 5398: Volume "Kantor-2018-01-08_08:48:50" previously written, moving to end of data.
19-Nov 02:33 kantor-sd JobId 5398: Ready to append to end of Volume "Kantor-2018-01-08_08:48:50" size=4716094462
19-Nov 02:33 kantor-sd JobId 5398: Elapsed time=00:00:05, Transfer rate=5.663 M Bytes/second

And that's all. Note, RunAfterJob script seem to had not been executed. But if I execute it by hand, it works (the exported catalog database file gets removed). This is not the only job with RunAfterJob script, however.

I expected it to show something like this in the end. All other jobs have it:

19-Nov 02:32 kantor-dir JobId 5397: Bareos kantor-dir 16.2.6 (02Jun17):
  Build OS:               x86_64-pc-linux-gnu debian Debian GNU/Linux buster/sid
  JobId:                  5397
  Job:                    FTP.2019-11-18_23.05.00_09
...
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

19-Nov 02:32 kantor-dir JobId 5397: Begin pruning Jobs older than 1 month 10 days .
...

Also, the director status looks strange:

*status dir
kantor-dir Version: 16.2.6 (02 June 2017) x86_64-pc-linux-gnu debian Debian GNU/Linux buster/sid
Daemon started 03-Dec-19 11:10. Jobs: run=4, running=1 mode=0 db=mysql
 Heap: heap=135,168 smbytes=222,459 max_bytes=236,758 bufs=543 max_bufs=594

Scheduled Jobs:
...
====

Running Jobs:
Console connected at 04-Dec-19 09:03
 JobId Level   Name                       Status
======================================================================
  5475 Full    BackupCatalog.2019-12-03_23.10.00_08 has terminated
====

Terminated Jobs:

 JobId  Level    Files      Bytes   Status   Finished        Name 
====================================================================
...
  5471  Incr      6,591    7.499 G  OK       03-Dec-19 23:15 termsrv
  5472  Incr        427    11.37 G  OK       03-Dec-19 23:44 1C
  5473  Incr          3    3.198 G  OK       04-Dec-19 02:56 Oracle
  5474  Incr      5,797    2.600 G  OK       04-Dec-19 02:56 FTP


Client Initiated Connections (waiting for jobs):
...
====

i.e. the said job listed in the "running jobs", but it says it is terminated. It doesn't listed in the "terminated jobs", as if director still has something to finish.

It hanged in this state for six hours. I also see some strangeness with times (StartTime for it in the table and in the log file differs by half of hour, however, system date and MySQL's select NOW(); are in sync).

After director restart, director status looks more appropriate:

Running Jobs:
Console connected at 04-Dec-19 09:06
No Jobs running.
====
No Terminated Jobs.

This all started two weeks ago. If I leave it hanging, all following scheduled jobs will wait indefinitely for this stuck job, meaning, no backups will be performed.

I feel this could be issue with this job's RunAfterJob script, but it is standard shipped script. If I run in by the hand, it works. The job definition itself is also standard shipped, the only modification is that I added compression=GZIP in the FileSet, but I do that every time and this never caused any problems.

What to look for? How to fix?

Update:

The problem disappeared. I don't understand, why. Backups work for at least two days. Nothing appears to be stuck.

1 Answers

Voted

Nikita Kipriyanov · Answer 1 · 2022-02-06T11:46:01+08:00

Best Answer

Nikita Kipriyanov

2022-02-06T11:46:01+08:002022-02-06T11:46:01+08:00

It appears it was configured to send bootstrap file by e-mail at the end of the backup in the BackupCatalog job:

Write Bootstrap = "|/usr/bin/bsmtp -h localhost -f \"\(Bareos\) \" -s \"Bootstrap for Job %j\" root@localhost"

If e-mail sending on the server is not configured, it will stuck. If the email sending was obstructed but fixed afterwards outside the server, it will suddenly unstuck without visible indication what wat changed. That seems to be my case.

By removing this Write Bootstrap the problem is completely avoided. (The job will write the local bootstrap file as configured in the JobDefs-referenced DefaultJob template.)

This is a deficiency in the BareOS that it doesn't explain what could go wrong and doesn't log this as the problem and go further. It just stucks. That sucks. Is't a pity also it is configured that way by default.

0

BareOS BackupCatalog job stuck in the Director as terminated, RunAfterJob did not run

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?