This looks very strange. I use Bacula and now BareOS for more than 10 years, but now one system experiences strange behaviour and I can't find out why and how to fix.
When it runs daily backups, it works fine, until it reaches the BackupCatalog job, which is configured to run after everything else.
This job looks like it was successfully terminated (JobStatus=T in the list jobs
table):
*list jobs
...
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+
| JobId | Name | Client | StartTime | Type | Level | JobFiles | JobBytes | JobStatus |
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+
...
| 5,475 | BackupCatalog | kantor-fd | 2019-12-04 02:56:40 | B | F | 21 | 27,364,860 | T |
+-------+---------------+--------------+---------------------+------+-------+----------+-----------------+-----------+
However, in messages
and in the log file I don't see the usual summary for this last job. The log file finishes like this:
19-Nov 02:32 kantor-dir JobId 5398: shell command: run BeforeJob "/usr/lib/bareos/scripts/make_catalog_backup.pl Kantor"
19-Nov 02:33 kantor-dir JobId 5398: Start Backup JobId 5398, Job=BackupCatalog.2019-11-18_23.10.00_10
19-Nov 02:33 kantor-dir JobId 5398: Using Device "FileStorage" to write.
19-Nov 02:33 kantor-sd JobId 5398: Volume "Kantor-2018-01-08_08:48:50" previously written, moving to end of data.
19-Nov 02:33 kantor-sd JobId 5398: Ready to append to end of Volume "Kantor-2018-01-08_08:48:50" size=4716094462
19-Nov 02:33 kantor-sd JobId 5398: Elapsed time=00:00:05, Transfer rate=5.663 M Bytes/second
And that's all. Note, RunAfterJob script seem to had not been executed. But if I execute it by hand, it works (the exported catalog database file gets removed). This is not the only job with RunAfterJob script, however.
I expected it to show something like this in the end. All other jobs have it:
19-Nov 02:32 kantor-dir JobId 5397: Bareos kantor-dir 16.2.6 (02Jun17):
Build OS: x86_64-pc-linux-gnu debian Debian GNU/Linux buster/sid
JobId: 5397
Job: FTP.2019-11-18_23.05.00_09
...
FD termination status: OK
SD termination status: OK
Termination: Backup OK
19-Nov 02:32 kantor-dir JobId 5397: Begin pruning Jobs older than 1 month 10 days .
...
Also, the director status looks strange:
*status dir
kantor-dir Version: 16.2.6 (02 June 2017) x86_64-pc-linux-gnu debian Debian GNU/Linux buster/sid
Daemon started 03-Dec-19 11:10. Jobs: run=4, running=1 mode=0 db=mysql
Heap: heap=135,168 smbytes=222,459 max_bytes=236,758 bufs=543 max_bufs=594
Scheduled Jobs:
...
====
Running Jobs:
Console connected at 04-Dec-19 09:03
JobId Level Name Status
======================================================================
5475 Full BackupCatalog.2019-12-03_23.10.00_08 has terminated
====
Terminated Jobs:
JobId Level Files Bytes Status Finished Name
====================================================================
...
5471 Incr 6,591 7.499 G OK 03-Dec-19 23:15 termsrv
5472 Incr 427 11.37 G OK 03-Dec-19 23:44 1C
5473 Incr 3 3.198 G OK 04-Dec-19 02:56 Oracle
5474 Incr 5,797 2.600 G OK 04-Dec-19 02:56 FTP
Client Initiated Connections (waiting for jobs):
...
====
i.e. the said job listed in the "running jobs", but it says it is terminated. It doesn't listed in the "terminated jobs", as if director still has something to finish.
It hanged in this state for six hours. I also see some strangeness with times (StartTime for it in the table and in the log file differs by half of hour, however, system date
and MySQL's select NOW();
are in sync).
After director restart, director status looks more appropriate:
Running Jobs:
Console connected at 04-Dec-19 09:06
No Jobs running.
====
No Terminated Jobs.
This all started two weeks ago. If I leave it hanging, all following scheduled jobs will wait indefinitely for this stuck job, meaning, no backups will be performed.
I feel this could be issue with this job's RunAfterJob script, but it is standard shipped script. If I run in by the hand, it works. The job definition itself is also standard shipped, the only modification is that I added compression=GZIP in the FileSet, but I do that every time and this never caused any problems.
What to look for? How to fix?
Update:
The problem disappeared. I don't understand, why. Backups work for at least two days. Nothing appears to be stuck.
It appears it was configured to send bootstrap file by e-mail at the end of the backup in the
BackupCatalog
job:If e-mail sending on the server is not configured, it will stuck. If the email sending was obstructed but fixed afterwards outside the server, it will suddenly unstuck without visible indication what wat changed. That seems to be my case.
By removing this
Write Bootstrap
the problem is completely avoided. (The job will write the local bootstrap file as configured in theJobDefs
-referencedDefaultJob
template.)This is a deficiency in the BareOS that it doesn't explain what could go wrong and doesn't log this as the problem and go further. It just stucks. That sucks. Is't a pity also it is configured that way by default.