I want to replicate in the region of 10Tb of data (lots of smallish files, low level of churn) across a WAN with minimal impact on the available infrastructure.
While I could simply use rsync, this means looking for the changes and comparing the local and remote data (disk I/O, network bandwidth and CPU costs) although rsync does this efficiently, I wonder of there is a more efficient solution which can track changes and propagate them (preferably bidirectionally).
The storage itself is iSCSI on HP NAS devices. We have looked previously at using its built-in replication capabilities but found them to be slow and unreliable.
DRBD mirrors would require additional hardware at both ends. Which would be rather expensive. I've also been bitten by DRBD replication failures in the past.
Would glusterfs be more efficient? Would it be really dumb to go with a 2 node setup? Is there a better solution?
On the block level, the synchronization can be done using Starwind that makes a mirrored disk on both ends. It can run over iSCSI LUNs, making active-active storage. No additional hardware required. https://www.starwindsoftware.com/blog/storage-ha-on-the-cheap-fixing-synology-diskstation-flaky-performance-with-starwind-free-part-3-failover-duration
On the file level, lsyncd and rsync do a mirror synchronizing files between servers. These tools might require tweaking the configuration files in order to ensure the file locking mechanism works as expected and no split-brain would occur. https://linoxide.com/tools/setup-lsyncd-sync-directories/
You could use
lsyncd
to have a constant syncing of files between systems.lsyncd
installs inotify watches on directories that are synced. Whenever files change in the directories, it will transfer changes to remote server usingrsync
.You could use ionice for io load limit and bwlimit argument in rsync for limit network io. There are also some other methods: Rsync huge dataset of small files 5TB, +M small files
If you willing to try something new, then IPFS might be a great tool for you experiment with.
https://ipfs.io/
Using a Private IPFS Cluster might be give you great results depending on your file replication needs.
https://cluster.ipfs.io/
However bear in mind, this is pretty new stuff, but is maturing very quickly.