Ping a Specific Port

Question

Zilvinas

Asked: 2012-06-16 01:38:48 +0800 CST2012-06-16 01:38:48 +0800 CST 2012-06-16 01:38:48 +0800 CST

Options to efficiently synchronize 1 million files with remote servers?

772

At a company I work for we have such a thing called "playlists" which are small files ~100-300 bytes each. There's about a million of them. About 100,000 of them get changed every hour. These playlists need to be uploaded to 10 other remote servers on different continents every hour and it needs to happen quick in under 2 mins ideally. It's very important that files that are deleted on the master are also deleted on all the replicas. We currently use Linux for our infrastructure.

I was thinking about trying rsync with the -W option to copy whole files without comparing contents. I haven't tried it yet but maybe people who have more experience with rsync could tell me if it's a viable option?

What other options are worth considering?

Update: I have chosen the lsyncd option as the answer but only because it was the most popular. Other suggested alternatives are also valid in their own way.

7 Answers

Voted

faker · Answer 1 · 2012-06-16T02:52:17+08:00

Best Answer

faker

2012-06-16T02:52:17+08:002012-06-16T02:52:17+08:00

Since instant updates are also acceptable, you could use lsyncd.
It watches directories (inotify) and will rsync changes to slaves.
At startup it will do a full rsync, so that will take some time, but after that only changes are transmitted.
Recursive watching of directories is possible, if a slave server is down the sync will be retried until it comes back.

If this is all in a single directory (or a static list of directories) you could also use incron.
The drawback there is that it does not allow recursive watching of folders and you need to implement the sync functionality yourself.

39

Steven Monday · Answer 2 · 2012-06-16T09:36:37+08:00

Steven Monday

2012-06-16T09:36:37+08:002012-06-16T09:36:37+08:00

Consider using a distributed filesystem, such as GlusterFS. Being designed with replication and parallelism in mind, GlusterFS may scale up to 10 servers much more smoothly than ad-hoc solutions involving inotify and rsync.

For this particular use-case, one could build a 10-server GlusterFS volume of 10 replicas (i.e. 1 replica/brick per server), so that each replica would be an exact mirror of every other replica in the volume. GlusterFS would automatically propagate filesystem updates to all replicas.

Clients in each location would contact their local server, so read access to files would be fast. The key question is whether write latency could be kept acceptably low. The only way to answer that is to try it.

11

Sven · Answer 3 · 2012-06-16T01:47:51+08:00

Sven

2012-06-16T01:47:51+08:002012-06-16T01:47:51+08:00

I doubt rsync would work for this in the normal way, because scanning a million files and comparing it to the remote system 10 times would take to long. I would try to implement a system with something like inotify that keeps a list of modified files and pushes them to the remote servers (if these changes don't get logged in another way anyway). You can then use this list to quickly identify the files required to be transferred - maybe even with rsync (or better 10 parallel instances of it).

Edit: With a little bit of work, you could even use this inotify/log watch approach to copy the files over as soon as the modification happens.

8

Ladadadada · Answer 4 · 2012-06-16T02:00:24+08:00

Ladadadada

2012-06-16T02:00:24+08:002012-06-16T02:00:24+08:00

Some more alternatives:

Insert a job into RabbitMQ or Gearman to asynchronously go off and delete (or add) the same file on all remote servers whenever you delete or add a file on the primary server.
Store the files in a database and use replication to keep the remote servers in sync.
If you have ZFS you can use ZFS replication.
Some SANs have file replication. I have no idea if this can be used over the Internet.

5

neovatar · Answer 5 · 2012-06-18T11:53:50+08:00

This seems to be an ideal storybook use case for MongoDB and maybe GridFS. Since the files are relatively small, MongoDB alone should be enough, although it may be convenient to use the GridFS API.

MongoDB is a nosql database and GridFS is a file storage build on top of it. MongoDB has a lot of built in options for replication and sharding, so it should scale very well in your use case.

In your case you will probably start with a replica set which consists of the master located in your primary datacenter (maybe a second one, in case you want to failover on the same location) and your ten "slaves" distributed around the world. Then do load tests to check if the write performance is enough and check the replication times to your nodes. If you need more performace, you could turn the setup into a sharded one (mostly to distribute the write load to more servers). MongoDB has been designed with scaling up huge setups with "cheap" hardware, so you can throw in a batch of inexpensive servers to improve performance.

Mister IT Guru · Answer 6 · 2012-06-16T04:02:26+08:00

Mister IT Guru

2012-06-16T04:02:26+08:002012-06-16T04:02:26+08:00

I would use an S3 Backend and then just mount that on all the servers that I need - That way, everyone is in sync instantly anyway

0

Supr · Answer 7 · 2012-06-16T06:33:00+08:00

Supr

2012-06-16T06:33:00+08:002012-06-16T06:33:00+08:00

An option that doesn't appear to have been mentioned yet is to archive all the files into one compressed file. This should reduce the total size significantly and remove all the overhead you get from dealing with millions of individual files. By replacing the entire set of files in one big update you can also rest assured that removed files are removed on the replicas.

The downside is of course that you are transferring a many files unnecessarily. That may or may not be balanced out by the reduced size thanks to compression. Also I have no idea how long it would take to compress that many files.

0

Options to efficiently synchronize 1 million files with remote servers?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?