We use rsync to update a mirror of our primary file server to an off-site colocated backup server. One of the issues we currently have is that our file server has > 1TB of mostly smaller files (in the 10-100kb range), and when we're transferring this much data, we often end up with the connection being dropped several hours into the transfer. Rsync doesn't have a resume/retry feature that simply reconnects to the server to pickup where it left off -- you need to go through the file comparison process, which ends up being very length with the amount of files we have.
The solution that's recommended to get around is to split up your large rsync transfer into a series of smaller transfers. I've figured the best way to do this is by first letter of the top-level directory names, which doesn't give us a perfectly even distribution, but is good enough.
I'd like to confirm if my methodology for doing this is sane, or if there's a more simple way to accomplish the goal.
To do this, I iterate through A-Z, a-z, 0-9 to pick a one character $prefix
. Initially I was thinking of just running
rsync -av --delete --delete-excluded --exclude "*.mp3" "src/$prefix*" dest/
(--exclude "*.mp3" is just an example, as we have a more lengthy exclude list for removing things like temporary files)
The problem with this is that any top-level directories in dest/ that are no longer present present on src will not get picked up by --delete. To get around this, I'm instead trying the following:
rsync \
--filter 'S /$prefix*' \
--filter 'R /$prefix*' \
--filter 'H /*' \
--filter 'P /*' \
-av --delete --delete-excluded --exclude "*.mp3" src/ dest/
I'm using the show
and hide
over include
and exclude
, because otherwise the --delete-excluded will delete anything that doesn't match $prefix.
Is this the most effective way of splitting the rsync into smaller chunks? Is there a more effective tool, or a flag that I've missed, that might make this more simple?
My solution to this was a different two-pass approach, where I trade off some disk space. I do rsync --only-write-batch on the server, then rsync the batch file itself to the destination, looping until the rsync succeeds. Once the batch is fully over rsync --read-batch on the destination recreates all the changes.
There are some unintended benefits to this for me as well:
because I'm more concerned that the backup "exists" than is "usable" I don't actually do the read-batch on the receiving end every day -- most of the time the batch is relatively small
I've been experimenting with --checksum-seed=1 ... I might be mis-reading the documentation but I think it makes the batch files more syncable (ie. when I don't do the --read-batch any given day, the next day's batch syncs faster because the previous day's batch is a good basis)
if the batch gets too big to send "in time" over the internet, I can sneaker-net it over on an external drive. By in-time I mean that if I can't get the batch over and read before the next day's backup starts.
although I don't personally do this, I could have two offsite backups in separate locations and send the batch to both of them.
Not exactly answering your question, but another option I use pretty often is doing this in a two pass approach: first build a list of files, then split the list of files to be transferred and feed the file list into rsync/cpio/cp etc.
rsync --itemize-changes <rest of options>
will print out a list of files to be transferred with a bunch of useful metadata, from that output it's fairly easy to extract the filenames and then do the actual copy with eitherrsync --files-from
or another tool.Could be useful for your situation - resuming from a broken transfer would be much quicker.
I would suggest you to keep a look into the connection problem, instead of trying to solve it by creating another "problem".
It's not a common behavior. Are you using rsync through SSH or rsyncd?
As far as I know most "closed" connections occur when there is no data being transferred between endpoints.