What are the best techniques to improve rsync over ssh mirroring between unix boxes, assuming that one system will always have the master copy and the other system will always have a recent copy (less than 48hrs old)
Also, what would one have to do to scale that approach to handle dozens of machines getting a push of those changes?
If :
You can use
find -ctime
orfile -cnewer
to make a list of changed file since the last execution, and copying over only the modified files (Just a glorified differential push).This translated itself quite nicely for multiple hosts : just do a differential tar on the source, and untar it on all the hosts.
It gives you something like that :
The script has te be refined, but you get the idea.
Presuming that the data you're rsyncing isn't already compressed, turning on compression (-z) will likely help transfer speed, at the cost of some CPU on either end.
rsync has a way of doing disconnected copies. In other words, rsync can (conceptually) diff a directory tree and produce a patch file which you then later can apply on any number of files that are identical to the original source.
It requires that you invoke rsync with the master and mirror with
--write-batch
; it produces a file. You then transfer this file to any number of other targets, and you then apply the batch to each of those targets using--read-batch
.If you keep a local copy of the last rsynced state (i.e. a copy of what the mirrors look like right now) on the same machine as the master, you can generate this "patch" on the master without even contacting any mirror:
On the master:
Add whatever other options you want. This will do two things:
/current/mirror
change to reflect/master/data
my-batch.rsync
for later use.Transfer the
my-batch.rsync
file from the master to all of your mirrors, and then on the mirrors, apply the patch so to speak:Benefits of this approach:
--read-batch
is only cpu/io intensive on the mirror itself)When you are rsyncing as a backup method, the biggest problem you will run into is going to be if you have a lot of files you are backing up. Rsync can handle large files without a problem but if the number of files you are backing up gets too large then you will notice that the rsync won't complete in a reasonable amount of time. If this happens you will need to break the backup down into smaller parts and then looping over those parts e.g.
or tarring the fileset down to reduce the number of files.
As for having dozens of machines getting a mirror of those changes, it depends on how fresh the backup needs to be. One approach would be to mirror the changes from the primary server to the backup server and then have the other servers pull their changes off the backup server either by an rsync daemon on the initial backup server and then either scheduling the other servers to pull at slightly different times or by having a script use passwordless ssh to connect to each of the servers and tell them to pull a fresh copy of the backup which would help prevent overwhelming your initial backup server - but whether you go to that much trouble is going to depend on how many other machines you have pulling a copy of the backup.
If you're transferring very large files with lots of changes, use the --inplace and --whole-file options, I use these for my 2Gb VM images and it helped a lot (mainly as the rsync protocol wasn't doing much with passing incremental data with these files). i don;t recommend these options for most cases though.
use --stats to see how well your files are being transferred using the rsync incremental protocol.
Another strategy is to make ssh and rsync faster. If you are going over a trusted network(read: private), then encrypting the actual payload is not necessary. You can use HPN ssh. This version of ssh only encrypts authentication. Also, rsync version 3 starts transfering files while building the file list. This of course is a huge time savings over rsync version 2. I don't know if that's what you were looking for, but I hope it helps. Also, rsync does support multicasting in some way, though I will not pretend to understand how.