I have a large a set of data (+100 GB) which can be stored into files. Most of the files would be in the 5k-50k range (80%), then 50k - 500k (15%) and >500k (5%). The maximum expected size of a file is 50 MB. If necessary, large files can be split into smaller pieces. Files can be organized in a directory structure too.
If some data must be modified, my application make a copy, modifies it and if successful, flags it as the latest version. Then, old version is removed. It is crash safe (so to speak).
I need to implement a failover system to keep this data available. One solution is to use a Master-Slave database system, but these are fragile and force a dependency on the database technology.
I am no sysadmin, but I read about the rsync instruction. It looks very interesting. I am wondering if setting some failover nodes and use rsync from my master is a responsible option. Has anyone tried this before successfully?
i) If yes, should I split my large files? Is rsync smart/efficient at detecting which files to copy/delete? Should I implement a specific directory structure to make this system efficient?
ii) If the master crashes and a slave takes over for an hour (for example), is making the master up-to-date again as simple as running rsync the other way round (slave to master)?
iii) Bonus question: Is there any possibility of implementing multi-master systems with rsync? Or is only master slave possible?
I am looking for advice, tips, experience, etc... Thanks !!!
Rsync is extremely efficient at detecting and updating files. Depending on how your files change, you might find a smaller number of large files are far easier to sync then lots of small files. Depending on what options you choose, on each run it is going to stat() every file on both sides, and then transfer the changes if the files are different. If only a small number of your files are changing, then this step to look for changed files can quite expensive. A lot of factors come into play about how long rsync takes. If you are serious about trying this you should do a lot of testing on real data to see how things work.
Should be.
Unison, which uses the rsync libraries allows a bi-directional sync. It should permit updates on either side. With the correct options it can identify conflicts and save backups of any files where a change was made on both ends.
Without knowing more about the specifics I can't tell you with any confidence this is the way to go. You may need to look at DRBD, or some other clustered device/filesystem approach which will sync things at a lower level.
Should I split my large files?
rsync is smart, but very large files can be dramatically less efficient to synchronize. Here's why:
If only a part of a file changes, then rsync is smart enough to only send that part. But to figure out which part to send, it has to divide the file into logical chunks of X bytes, build checksums for each chunk (on both sides), compare the chunks, send the differences, and then re-construct the file on the receiving end.
On the other hand, if you have a bunch of small files which don't change, then the dates and sizes will match and rsync will skip the checksum step and just assume that the file hasn't changed. If we're talking about many GB of data, you're skipping a LOT of IO, and saving a LOT of time. So even though there's extra overhead involved with comparing more files, it still comes out to less than the amount of time required to actually read the files and compare the checksums.
So, while you want as few files as necessary, you also want enough files so that you won't waste a lot of IO working on unchanged data. I'd recommend splitting the data along the logical boundaries your application uses.
is making the master up-to-date again as simple as running rsync the other way round
From a filesystem perspective, yes. But your application might have other requirements that complicate things. And, of course, you'll be reverting back to your most recent checkpoint at which you rsync'ed to your slave.
Is there any possibility of implementing multi-master systems with rsync?
Technically yes, but down that path lies madness. Assuming everything works great, then everything will be fine. But when there's hiccups, you can start to run into problems with changes (and specifically deletes) getting synced the wrong direction, overwriting your good files with your bad ones, or deleting your inserted files, or the ghosts of deleted files reappearing. Most people recommend against it, but you can try it if you like.
advice, tips, experience
If you're looking for a master/master setup with on-the-fly syncing, I'd recommend DRBD. It's significantly more complicated to set up and maintain, but a lot more capable. It does block-level synchronization of the disk itself, rather than the files on it. To do this "on-line", you need a filesystem that's can tolerate that type of synchronization, like GFS.
Rsync is more like a snapshot system than a continuous synchronization system.