we have a storage server, with currently an amount of about 20TB of media files which we want to synchronize with a second storage server, for backup and fail-over. the facts are:
- we're storing currently about 9.000.000 files
- file sizes from several KB up to 1 GB
- only one-way synchronizing required
- the files do not get updated and there are no deletes -- only new files to synchronize
- the storage servers are running open-e, they are mounted as NFS volumes in the network
currently we use just plain rsync on a third-server to perform the synchronization.
i would like to know, if there are better tools for such an amount of files -- commercial or open-source?
thanx very much,
You might see an increase in performance if you just used a hand rolled script that checked file creation time (and possibly size) and compared it to a list (or registry) of files that have already been synced to the backup server.
Rsync might be spending a lot of time checking for changes in ALL of your files when a check of one or two file attributes might be enough.
We do something similar, but on a much smaller scale, to synchronize photos between two servers. I wrote a bash script that maintains a sorted registry of files concatenated together with file creation times and file sizes. Every time the script runs, it checks the server that we sync from (the source server) and generates a sorted list of files concatenated with creation times and file sizes. I then use the comm command to compare those two registries and only print those entries that appear on the source server. This is the list of files that must be synced to the new server.
Then I just scp the new files over. I have some trapping, locking and throttling in there so that it doesn't overwhelm stuff, but it works and is pretty quick.
The nice thing is that you don't have to sync everything to start if you already have a lot of the files in both places. Just create an initial registry on the target server, then cron up the script and it will start syncing from that point. If you end up needing to sync a file that you never thought you would, all you have to do is touch it (to change the date info) on the source server and it will sync on the next scheduled run.
So for a directory that looks like this:
This listing gets transformed by the script into a registry file that looks like this:
I store that registry (of the source server's files) on the target server. Every time the cron job runs on the target server, I create a list of the files currently on the source server using the same format. Let's say that a few new files, 10.JPG and 11.JPG have appeared in the listing.
The current file registry will look like this:
Running comm against the old registry and the current registry and cutting out the first field (the file that needs to be copied) like so:
Will yield a list of files (one per line) that needs to be copied (I use scp) to the backup (target) server:
Then you just process that list of files through a loop.
The comm command above is basically saying show me everything that ONLY exists in the first file. The comparisons it makes are also very fast. It's just comparing lines in a text file after all; even if that file is very large. Luckily, you've populated that text file with some basic meta-data about your files and, through comm, are comparing that data very quickly.
The nice thing about stuffing that meta-data into the list is that it will allow for situations where the file has changed between syncs. Say a new version of the file comes along or there was a problem with the old one. The name of the file will exist in the old registry, but its meta-data (file creation time stamp and size) will be different. So the current file registry will show that difference and the comm comparison will say that that information only exists in the first file. When you create the list of files to copy, that file name will be in there and your copy command will overwrite the out of date file with the same name.
The rest is just details:
Hope that helps. This works for our situation quite well, but, as with all things, might not work within the constraints of your organization or setup. Good luck, at the very least it might give you some ideas.
Here are a few of options to look into.
Look into DBRD if you don't need to access both copies at the same time. This is due to file system limitations not limitations of DBRD and there are a few work arounds to access the 2ndary copy if you need to. But the project was recently accepted into the kernel so support for this going forward should be pretty simple.
Another option would be a file system such as GlusterFS. Which can be setup a 2 node replicated configuration. I think this would be ideal as it should allow for a better failover and scalability. MondoDB is also looking interesting for this sort of stuff using their GridFS but it's a bit newer.