Ping a Specific Port

Question

harald

Asked: 2010-02-25 04:33:44 +0800 CST2010-02-25 04:33:44 +0800 CST 2010-02-25 04:33:44 +0800 CST

synchronizing storage servers

772

we have a storage server, with currently an amount of about 20TB of media files which we want to synchronize with a second storage server, for backup and fail-over. the facts are:

we're storing currently about 9.000.000 files
file sizes from several KB up to 1 GB
only one-way synchronizing required
the files do not get updated and there are no deletes -- only new files to synchronize
the storage servers are running open-e, they are mounted as NFS volumes in the network

currently we use just plain rsync on a third-server to perform the synchronization.

i would like to know, if there are better tools for such an amount of files -- commercial or open-source?

thanx very much,

2 Answers

Voted

Jim · Answer 1 · 2010-02-25T08:51:27+08:00

You might see an increase in performance if you just used a hand rolled script that checked file creation time (and possibly size) and compared it to a list (or registry) of files that have already been synced to the backup server.

Rsync might be spending a lot of time checking for changes in ALL of your files when a check of one or two file attributes might be enough.

We do something similar, but on a much smaller scale, to synchronize photos between two servers. I wrote a bash script that maintains a sorted registry of files concatenated together with file creation times and file sizes. Every time the script runs, it checks the server that we sync from (the source server) and generates a sorted list of files concatenated with creation times and file sizes. I then use the comm command to compare those two registries and only print those entries that appear on the source server. This is the list of files that must be synced to the new server.

Then I just scp the new files over. I have some trapping, locking and throttling in there so that it doesn't overwhelm stuff, but it works and is pretty quick.

The nice thing is that you don't have to sync everything to start if you already have a lot of the files in both places. Just create an initial registry on the target server, then cron up the script and it will start syncing from that point. If you end up needing to sync a file that you never thought you would, all you have to do is touch it (to change the date info) on the source server and it will sync on the next scheduled run.

So for a directory that looks like this:

-rw-r----- 1 example example  38801 2010-01-21 11:45 1.JPG
-rw-r----- 1 example example  38801 2010-01-21 11:45 2.JPG
-rw-r----- 1 example example 757638 2010-01-21 11:45 3.JPG
-rw-r----- 1 example example  16218 2010-01-22 15:07 9.JPG

This listing gets transformed by the script into a registry file that looks like this:

1.JPG_2010-01-21_11:45_38801
2.JPG_2010-01-21_11:45_38801
3.JPG_2010-01-21_11:45_757638
9.JPG_2010-01-22_15:07_16218

I store that registry (of the source server's files) on the target server. Every time the cron job runs on the target server, I create a list of the files currently on the source server using the same format. Let's say that a few new files, 10.JPG and 11.JPG have appeared in the listing.

-rw-r----- 1 example example  38801 2010-01-21 11:45 1.JPG
-rw-r----- 1 example example  38801 2010-01-21 11:45 2.JPG
-rw-r----- 1 example example 757638 2010-01-21 11:45 3.JPG
-rw-r----- 1 example example  16218 2010-01-22 15:07 9.JPG
-rw-r----- 1 example example  16218 2010-02-24 11:00 10.JPG
-rw-r----- 1 example example  16218 2010-02-24 11:00 11.JPG

The current file registry will look like this:

1.JPG_2010-01-21_11:45_38801
2.JPG_2010-01-21_11:45_38801
3.JPG_2010-01-21_11:45_757638
9.JPG_2010-01-22_15:07_16218
10.JPG_2010_02_24_11:00_16218
11.JPG_2010_02_24_11:00_16218

Running comm against the old registry and the current registry and cutting out the first field (the file that needs to be copied) like so:

comm -23 ${CURRENT_REG} ${OLD_REG} | cut -d'_' -f1 > ${SYNC_LIST}

Will yield a list of files (one per line) that needs to be copied (I use scp) to the backup (target) server:

10.JPG
11.JPG

Then you just process that list of files through a loop.

The comm command above is basically saying show me everything that ONLY exists in the first file. The comparisons it makes are also very fast. It's just comparing lines in a text file after all; even if that file is very large. Luckily, you've populated that text file with some basic meta-data about your files and, through comm, are comparing that data very quickly.

The nice thing about stuffing that meta-data into the list is that it will allow for situations where the file has changed between syncs. Say a new version of the file comes along or there was a problem with the old one. The name of the file will exist in the old registry, but its meta-data (file creation time stamp and size) will be different. So the current file registry will show that difference and the comm comparison will say that that information only exists in the first file. When you create the list of files to copy, that file name will be in there and your copy command will overwrite the out of date file with the same name.

The rest is just details:

Use file locking/semaphores so you don't have the script running over itself if the last time it ran isn't done by the next time it runs.
Use temp files to store your current file list and your process list and then use traps to clean them up on script exit.
When you are all done, copy the current file list over the old file list to be ready for the next comparison, but only do it if the registry has files in it (otherwise, you'll copy over an empty registry and sync EVERYTHING the next time).

Hope that helps. This works for our situation quite well, but, as with all things, might not work within the constraints of your organization or setup. Good luck, at the very least it might give you some ideas.

3dinfluence · Answer 2 · 2010-02-25T07:52:30+08:00

3dinfluence

2010-02-25T07:52:30+08:002010-02-25T07:52:30+08:00

Here are a few of options to look into.

Look into DBRD if you don't need to access both copies at the same time. This is due to file system limitations not limitations of DBRD and there are a few work arounds to access the 2ndary copy if you need to. But the project was recently accepted into the kernel so support for this going forward should be pretty simple.

Another option would be a file system such as GlusterFS. Which can be setup a 2 node replicated configuration. I think this would be ideal as it should allow for a better failover and scalability. MondoDB is also looking interesting for this sort of stuff using their GridFS but it's a bit newer.

1

synchronizing storage servers

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?