Ping a Specific Port

Question

sal

Asked: 2009-05-05 06:46:30 +0800 CST2009-05-05 06:46:30 +0800 CST 2009-05-05 06:46:30 +0800 CST

improving rsync backup performance

772

What are the best techniques to improve rsync over ssh mirroring between unix boxes, assuming that one system will always have the master copy and the other system will always have a recent copy (less than 48hrs old)

Also, what would one have to do to scale that approach to handle dozens of machines getting a push of those changes?

6 Answers

Voted

Steve Schnepp · Answer 1 · 2009-05-05T07:01:24+08:00

Best Answer

Steve Schnepp

2009-05-05T07:01:24+08:002009-05-05T07:01:24+08:00

If :

The modification time of your files are right
The files are not really big
No push can be missed (or there is some kind of backlog processing)

You can use find -ctime or file -cnewer to make a list of changed file since the last execution, and copying over only the modified files (Just a glorified differential push).

This translated itself quite nicely for multiple hosts : just do a differential tar on the source, and untar it on all the hosts.

It gives you something like that :

find -type f -cnewer /tmp/files_to_send.tar.gz > /tmp/files_to_send.txt
tar zcf /tmp/files_to_send.tar.gz --files-from /tmp/files_to_send.txt 
for HOST in host1 host2 host3 ...
do
    cat /tmp/files_to_send.tar.gz | ssh $HOST "tar xpf -"
done

The script has te be refined, but you get the idea.

6

pjz · Answer 2 · 2009-05-05T06:50:20+08:00

pjz

2009-05-05T06:50:20+08:002009-05-05T06:50:20+08:00

Presuming that the data you're rsyncing isn't already compressed, turning on compression (-z) will likely help transfer speed, at the cost of some CPU on either end.

4

mogsie · Answer 3 · 2012-06-01T15:18:27+08:00

rsync has a way of doing disconnected copies. In other words, rsync can (conceptually) diff a directory tree and produce a patch file which you then later can apply on any number of files that are identical to the original source.

It requires that you invoke rsync with the master and mirror with --write-batch; it produces a file. You then transfer this file to any number of other targets, and you then apply the batch to each of those targets using --read-batch.

If you keep a local copy of the last rsynced state (i.e. a copy of what the mirrors look like right now) on the same machine as the master, you can generate this "patch" on the master without even contacting any mirror:

On the master:

rsync --write-batch=my-batch.rsync /master/data /current/mirror

Add whatever other options you want. This will do two things:

It will make /current/mirror change to reflect /master/data
It will create a binary patch file (or batch file) called my-batch.rsync for later use.

Transfer the my-batch.rsync file from the master to all of your mirrors, and then on the mirrors, apply the patch so to speak:

rsync --read-batch=my-batch.rsync /local/mirror

Benefits of this approach:

master is not swamped
no need to coordinate/have access to the master / mirror(s) at the same time
different people with different privileges can do the work on the master and mirror(s).
no need to have a TCP channel (ssh, netcat, whatever; the file can be sent via e-mail ;-) )
offline mirrors can be synced later (just bring them on-line and apply the patch)
all mirrors guaranteed to be identical (since they apply the same "patch")
all mirrors can be updated simultaneously (since the --read-batch is only cpu/io intensive on the mirror itself)

Rodney Amato · Answer 4 · 2009-05-06T01:33:22+08:00

When you are rsyncing as a backup method, the biggest problem you will run into is going to be if you have a lot of files you are backing up. Rsync can handle large files without a problem but if the number of files you are backing up gets too large then you will notice that the rsync won't complete in a reasonable amount of time. If this happens you will need to break the backup down into smaller parts and then looping over those parts e.g.

find /home -mindepth 1 -maxdepth 1 -print0 | xargs -0 -n 1 -I {} -- rsync -a -e ssh {} backup@mybackupserver:/backup/

or tarring the fileset down to reduce the number of files.

As for having dozens of machines getting a mirror of those changes, it depends on how fresh the backup needs to be. One approach would be to mirror the changes from the primary server to the backup server and then have the other servers pull their changes off the backup server either by an rsync daemon on the initial backup server and then either scheduling the other servers to pull at slightly different times or by having a script use passwordless ssh to connect to each of the servers and tell them to pull a fresh copy of the backup which would help prevent overwhelming your initial backup server - but whether you go to that much trouble is going to depend on how many other machines you have pulling a copy of the backup.

gbjbaanb · Answer 5 · 2009-06-02T08:48:52+08:00

gbjbaanb

2009-06-02T08:48:52+08:002009-06-02T08:48:52+08:00

If you're transferring very large files with lots of changes, use the --inplace and --whole-file options, I use these for my 2Gb VM images and it helped a lot (mainly as the rsync protocol wasn't doing much with passing incremental data with these files). i don;t recommend these options for most cases though.

use --stats to see how well your files are being transferred using the rsync incremental protocol.

2

Jackalheart · Answer 6 · 2009-06-02T09:00:31+08:00

Jackalheart

2009-06-02T09:00:31+08:002009-06-02T09:00:31+08:00

Another strategy is to make ssh and rsync faster. If you are going over a trusted network(read: private), then encrypting the actual payload is not necessary. You can use HPN ssh. This version of ssh only encrypts authentication. Also, rsync version 3 starts transfering files while building the file list. This of course is a huge time savings over rsync version 2. I don't know if that's what you were looking for, but I hope it helps. Also, rsync does support multicasting in some way, though I will not pretend to understand how.

2

improving rsync backup performance

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?