We have a folder structure on our intranet which contains around 800,000 files divvied up into around 4,000 folders. We need to synchronize this to a small cluster of machines in our DMZs. The depth of the structure is very shallow (it never exceeds two levels deep).
Most of the files never change, each day there are a few thousand updated files and 1-2 thousand new files. The data is historical reporting data being maintained where the source data has been purged (i.e. these are finalized reports for which the source data is sufficiently old that we archive and delete it). Synchronizing once per day is sufficient given that it can happen in a reasonable time frame. Reports are generated overnight, and we sync first thing in the morning as a scheduled task.
Obviously since so few of the files change on a regular basis, we can benefit greatly from incremental copy. We have tried Rsync, but that can take as long as eight to twelve hours just to complete the "building file list" operation. It's clear that we are rapidly outgrowing what rsync is capable of (12 hour time frame is much too long).
We had been using another tool called RepliWeb to synchronize the structures, and it can do an incremental transfer in around 45 minutes. However it seems we've exceeded its limit, it has started seeing files show up as deletes when they are not (maybe some internal memory structure has been exhausted, we're not sure).
Has anyone else run into a large scale synchronization project of this sort? Is there something designed to handle massive file structures like this for synchronization?