Several bacula-fd clients are backing up to one bacula SD (using 2G disk files on massive RAID, not tape), usually 2-3 of them concurrently. It works fine, except when 2 bigger clients (about 400-900GB each) need to run full backups, which get terribly slow (usually about 200-2500 Kbyte/s), so full backup does not finish in several days, which is a problem for us (so we cancel it and go with incremental).
The server and clients are in different VLANs/subnets, so are routed over another debian wheezy machine with VLANs and few switches. NICs are 1Gbps on all machines (one having active-passive network bonding - failing it over does not help with speed), as are the switches. Machines are quad and 8-core, having 8-64GB RAM, not going into swap, and having load between 0.2-2, so I guess it is not CPU, I/O nor memory starvation. Bacula-sd is also on hardware RAID which does not seem to be under load. Network is also mostly idle at the time, so it should not be bandwidth congestion. Bacula version is standard wheezy 5.2.6+dfsg-9, and Linux kernel is also standard wheezy 3.2.60-1+deb7u3.
It seems the transfer starts just fine (at about 20+ Mbytes/sec, which is to be expected with lots of small files), than at one moment Send-Q goes up and does not come down for dozens of seconds (or minutes), and netstat shows socket in "unkn-4" timer, restarting with exponentially rising timeouts:
# netstat -tpno | grep bacula
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 967688 10.66.3.135:49668 10.66.2.11:9103 ESTABLISHED 2964/bacula-fd unkn-4 (1.86/0/0)
tcp 0 0 10.66.3.135:9102 10.66.2.11:54499 ESTABLISHED 2964/bacula-fd keepalive (5882.64/0/0)
then after some time, the packets seems to start again:
# netstat -tpno | grep bacula
tcp 0 38054 10.66.3.135:49668 10.66.2.11:9103 ESTABLISHED 2964/bacula-fd on (0.21/0/0)
tcp 0 0 10.66.3.135:9102 10.66.2.11:54499 ESTABLISHED 2964/bacula-fd keepalive (385.49/0/0)
and backup continues (confirmed with status client=blowgun-fd
on bconsole). For example:
* status client=axe-fd
newaxe-fd Version: 5.2.6 (21 February 2012) x86_64-pc-linux-gnu debian 7.0
Daemon started 24-Oct-14 17:27. Jobs: run=0 running=0.
Heap: heap=683,600 smbytes=761,617 max_bytes=807,280 bufs=396 max_bufs=426
Sizeof: boffset_t=8 size_t=8 debug=200 trace=1
Running Jobs:
JobId 12640 Job axe.2014-10-24_23.05.01_40 is running.
Full Backup Job started: 24-Oct-14 23:05
Files=2,529,050 Bytes=253,018,715,824 Bytes/sec=1,479,901 Errors=6
Files Examined=2,529,050
Processing file: /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
SDReadSeqNo=5 fd=5
Director connected at: 26-Oct-14 21:34
bg.jpg is 1.2MB in size, and it was staying on it for minutes...
Hearbeat interval is set to 120 in director, SD and file deamon configurations, and seems to be working fine. Enabling debug trace file with setdebug level=200 trace=1 all
, I see:
newaxe-fd: backup.c:371-0 FT_REG saving: /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:469-0 bfiled: sending /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg to stored
newaxe-fd: crypto.c:607-0 crypto_digest_new jcr=2f01748
newaxe-fd: backup.c:1467-0 No strip for /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:609-0 type=3 do_read=1
newaxe-fd: bfile.c:963-0 open file /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:1194-0 Send data to SD len=65135
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
strace seems to confirm that:
# strace -tt -ff -s999 -p 3907
Process 3907 attached with 4 threads - interrupt to quit
[pid 27650] 22:25:15.705796 write(5, "[....]"..., 55110 <unfinished ...>
[pid 27661] 22:25:15.706103 select(6, [5], NULL, NULL, {2, 804806} <unfinished ...>
[pid 3912] 22:25:15.706147 restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 3907] 22:25:15.706168 select(4, [3], NULL, NULL, NULL <unfinished ...>
[pid 3912] 22:25:16.619938 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 3912] 22:25:16.620008 futex(0x397d82d0240, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 3912] 22:25:16.620092 futex(0x397d82d0284, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 13229, {1414358746, 620076000}, ffffffff <unfinished ...>
[pid 27661] 22:25:18.513819 <... select resumed> ) = 0 (Timeout)
[pid 27661] 22:25:18.513858 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:18.513928 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:23.519025 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:23.519139 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:28.524240 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:28.524317 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:33.529409 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:33.529508 select(6, [5], NULL, NULL, {5, 0}^C <unfinished ...>
fd 5 is network connection to bacula-sd, and the process is blocking when writing to that. Researching unkn-4 seems to indicate it is actually "zero window probe timer is pending".
So, it looks to me like it is either bacula-sd doing the throttling for some reason (bug?) or (more likely IMHO) some sort of networking issue.
it doesn't seem like there are errors on active ethernet adapters. I've tried using ethtool
to disable offloading and other features, it didn't help. ping -f
does not lose packets even when the problem in TCP is manifesting itself:
axe# ping -s1400 -f -c1000 10.66.2.11
--- slingshot.tomsoft.lan ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 607ms
rtt min/avg/max/mdev = 0.391/0.582/0.672/0.039 ms, ipg/ewma 0.608/0.585 ms
I'm looking for ideas how to proceed to troubleshoot (and finally fix, of course) this?
UPDATE1: even after tuning up TCP buffers situation is not any better - just the queue gets bigger, but still blocks, and backup is still slow. After looking some more at wireshark dumps, it seems to be bacula-sd software problem, and that TCP ZeroWindow is normal kernel way of throttling TCP in that case. So machine seems to receive data OK, but bacula-sd does not seem to be able to process data fast enough, although the machine is not under load; this is on bacula-sd side:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 3952353 0 10.66.2.11:9103 10.66.3.132:45226 ESTABLISHED 15839/bacula-sd keepalive (4863.09/0/0)
# uptime
05:23:13 up 2 days, 14:40, 2 users, load average: 0.42, 0.32, 0.27
It was SQL.
By default, each time bacula-fd sends a new file, bacula-sd tries to (via bacula-dir) insert file attributes into SQL
batch
table. If you have a lots of small files, and your SQL is not blazingly fast, it will insert small delay. Many small delays = throttled speed = canceled backups due toMax Run Sched Time
exceeded. And due to architecture even if you had multiple backups running, all would slow down.Solution (kind of) was to add:
in
JobDefs {...}
section ofbacula-dir.conf
(note I'm usingSpool Data = no
because it is disk storage, not tape storage, so it would just add overhead). WithSpool Attributes = yes
bacula writes file attributes to a file, and only when job is finished is the file sent to SQL server. Note that in that connection tobacula-fd
is freed (and disk/network load on client ends) as soon as data transfer is done (so it does not wait for SQL insertion to finish).Some notes:
LOAD DATA LOCAL INFILE
), but sends that 3 million inserts one by one, waiting for confirmation on each (attribute data is stored in compact binary file about 1GB big. Translated to ASCII SQL INSERT statements it will surely be double that). So the increased latency due to MySQL being on other machine seems to totally kill the performance.Edit1: upgrading bacula to (manually created package) 5.2.13 which does contain support for batch inserts (instead of packaged 5.2.6 in Debian wheezy/jessie which doesn't), and tuning MySQL database so temporary tables are created in tmpfs, reduced the mentioned attribute despooling time from 56.5 hours down to 30 minutes. I guess replacing it with PostgreSQL on local machine (same where bacula-sd & bacula-dir are) might probably improve even on that, but this is good enough for us.