We are using two servers separated by a WAN to replicate approximately 1TB of data.
On the master side we have a single server with a Gluster volume exported to a number of other servers that write in data.
On the slave side we have a single server with a Gluster volume exported as a read only share to disaster recovery servers.
Over time the slave has become out of sync with the master to the tune of 200gb, files that should be there aren't and files that have been deleted are. There does not appear to be a great deal of consistency in this.
What is the simplest way to force cluster to checksum every file on the slave and re-replicate where required?
The documentation suggests:
Description: GlusterFS Geo-replication did not synchronize the data completely but still the geo-replication status display OK.
Solution: You can enforce a full sync of the data by erasing the index and restarting GlusterFS Geo-replication. After restarting, GlusterFS Geo-replication begins synchronizing all the data, that is, all files will be compared with by means of being checksummed, which can be a lengthy /resource high utilization operation, mainly on large data sets (however, actual data loss will not occur). If the error situation persists, contact Gluster Support.
But does not refer to where this index may be.
# gluster volume geo-replication share gluk1::share stop
Stopping geo-replication session between share & gluk1::share has been successful
# gluster volume set share geo-replication.indexing off
volume set: failed: geo-replication.indexing cannot be disabled while geo-replication sessions exist
This index shutoff fails while the connection still exists at all and the documentation doesn't mention this requirement.
Any suggestions?
Your slaves became out of sync because GlusterFS Geo-Replication is not meant for multiple changing data pool (distributed FS), rather for disaster recovery (read-only backup).
In short, geo-replication is a master/slave model, where only the master site pushes writes/changes, and any changes is periodically synched to the remote read-only slave.
To have a true distributed, replicated filesystem you had to use GlusterFS's "Replicated Volume" feature. The drawback is that with current replication scheme writes are forced to be synchronous: this means that if you are replicating between a WAN link, even your local, intra-LAN writes will be as slow as the WAN path. To overcome this limit, a "New Style Replication" is considered for inclusion, but it seems to not be implemented yet (at least on stable, enterprise distribution).
Back to your current situation, you are in a classical "split-brain scenario" and I am not sure what you can do: your master and slaves have different view of the underlying volumes, and they probably accumulated different, incompatible changes to the same files. I think you had to (more or less) manually review them...