We wish to install some servers in a remote datacentre to act as a backup storage location for our primary datacentre.
Assuming that both sites will have GigE connectivity, what is the best method to use for fast file transfer? I love rsync, however since we have a lot of data to transfer (1.5TB per night), I think that the SSH protocol used in rsync may slow things down a lot :(
We could install some fast VPN endpoints to cater for link encryption, however the question still stands: what is the best tool for the actual transfer?
backup performance is determined by many factors. Bandwidth being one of them.
Often determined by the storage write performance.
A good option is to run rsync in daemon mode on the backup server, doing this you would avoid ssh. However unless you are having really slow processors, ssh overhead would not be significant.
To run rsync as daemon start rsync daemon on the server
By default it listens on TCP port 873 you can change it in rsyncd.conf.
Then use rsync as
There is not enough information to give an estimation of your expected performance. Yet daily addition of 1.5 TB is not impossible.
During backup you combine write operations with a number of file system ops. Filesystem queries and updates. It is generally a good idea to run several rsync processes to hide the latency of file create.
You may want to look into file acceleration software. I think there are many players in this market, but the one I have seen used in the past was aspera. Here is a page comparing aspera sync to rsync (comparison tables at bottom of page).
http://asperasoft.com/en/products/synchronization_23/aspera_sync_23
Also, make sure that no side involved uses any really old versions of rsync. There are still 2.x versions in use, these make the whole chain fall back to an older and in some cases far less efficient version of the protocol (If you are told "sending incremental file list", you are fine. If it is "sending file list", that is 2.x protocol used.)
I think 1,5 TB delta/day is a bit out of the typical size for solutions like rsync. SSH has a architectural cap at about 2-3MB/s IIRC and as written before the default rsync protocol is much faster but unencrypted.
You should really have a look at solutions which are specifically designed to synchronize these amounts of data. What I have worked with in the past are the
Quantum DXi
appliances which are storage appliances but also offer deduplication and encrypted replication. You might want to have a look at these./edit: To extend my above statement a bit more, it is important to take the following things into consideration when measuring SSH speed:
The big advantage on deduplication here would be that data is deduplicated on a block level. Meaning if you'd create one tar (not compressed!) per customer and put that an one of the DXi appliances at your main site this appliance automatically will eliminate duplicate blocks in the file stream (e.g. 100 customers have the same movie in their tar - it will only be stored once and will be referenced the other 99 times), and the blocks will also be compressed.
If you then add a second one off-site only the unique data blocks are transferred to the second appliance. With that you could in fact perform daily full backups at your main site and only the size of newly written unique blocks would have to be transferred over WAN to the off-site
someone mentioned here using rsync daemon - this is much 'lighter' solution than tunneling the traffic over ssh. but even with ssh encapsulation transferring 1.5TB over night and saturating gigabit link should be doable.
assuming you have few large files [possibly wrong assumption] - you should be able to transfer the payload within ~5h. i've done a quick test:
telling ssh to use lighter compression method:
so assuming storage is not a bottleneck - 106MB/s ~= 350GB/h ~= 1.5TB in 5h.
both tests were done on idle machine with xeon E5430 @ 2.66GHz cpu.
to get things more efficient [make use of multiple cores if you have slower CPU] or just use better available bandwidth and IO - you can run few parallel rsync sessions for a few files.
i dont know if you own/lease the fiber or just use mpls service provided by the operator regardless of those ssh gives you additional benefit of strong encryption without setting vpn in-between.