We have always had problems with DFS, but recently it has gotten worse with no apparent cause and it's becoming harmful. We have one master server and DFS connections to other four servers. The four severs don't modify any files, so all replications always propagate from the master to the four other servers. The replicated directory has about 900,000 files. In recent weeks, every time we check the DFS backlogs have hundreds of thousand of files. For instance, at the moment, the master server replicating about 700,000 files to three of the four servers while the fourth one is fine. Sometimes, only one is off, sometimes two and this time three. Also, it is never the same set of servers. It is inconceivable that something periodically touches all 900,000 files. The biggest change which happens is a scheduled update of several thousand files every six hours.
Does anybody have the same problem? Is it a known issue?
Update: (This is also an answer to some of the questions raised by Jeff Miles). The problem again happened few hours ago. I setup some probes in the morning and monitored the servers during the day, and at a seemingly random time, three backlogs ballooned to 3 million changes (which is more than the total number of files) within a minute. Nothing interesting in the DFS Event Log. Even no "started initial replication". Only a couple of "DFS connection lost or unresponsive" errors, but they happened about 10 minutes after the fact. Most likely because something choked on the huge backlogs. More importantly, the fourth server is fine. This indicates that the 3 million changes are most likely bogus. Also, I can't imagine anything changing that many files within such a short interval. Regarding the technical setup; it is a combination of Win2003R2 and Win2008R2. Could it be a problem?
First, verify your topology. Carefully review the replication connections under the "Connections" tab in your replication set properties:
I have seen full mesh topologies accidentally added that result in problems like you are seeing.
Other possible culprits: - Antivirus scanning or file indexing on one or more of the servers or one of their clients. (Opening a file updates its access time, which must then be replicated to all peers.) - One or more very large files jamming up replication - This should show in your DFS-R logs.
Finally, do you need DFS-R, or could a regular robocopy be used to keep the folders in sync?
If you're seeing hundreds of thousands of files in the backlog on a regular basis, I would guess that something is changing the security ACLs on your files, especially if you aren't seeing much network traffic while the backlog clears.
One way to check out what is modifying these files is to turn on Auditing. Ned Pyle with the Microsoft Directory Services team recently put out a blog that uses Global Object Access Auditing that might help you determine what is changing: http://blogs.technet.com/b/askds/archive/2011/03/10/global-object-access-auditing-is-magic.aspx
I would check your DFSR event log too, and look for any event ID 4102 (started initial replication) or 4104 (initial replication finished). If your files aren't being modified, the only reason I can think of for hundreds of thousands of files in the backlog is initial replication. If your DFSR service is crashing it could corrupt the DFSR database and trigger initial replication.
If you can, I'd try to use Read Only DFSR, described here: http://blogs.technet.com/b/askds/archive/2010/03/08/read-only-replication-in-r2.aspx
I imagine based on your Server 2003 tag that you can't do it yet, but its worth a mention based on your use case.
http://blogs.technet.com/b/askds/archive/2010/03/08/read-only-replication-in-r2.aspx
Since you are seeing unreasonable amounts of files being replicated within a very short period, there must be an application that is changing file attributes or USN Journal values without changing file data e.g. Backup software changing the Archive bit would trigger this as well as some AV software.
Testing Anti-Virus Application Interoperability with DFS Replication
I would set up a test replication group to troubleshoot against and test items such as Backup software, AV software, etc effects on replication. I would also in addition to the other recommendations you have received, log and watch for changes in the USN Journal without file data changing. The link provided is a good article on checking for applications changing the USN Journal without changing file data and therefore causing excessive replication.
Watch out for File Screens, Quotas, etc as well. I have seen some scenarios where a file screen stopped replication altogether.
Is you Antivirus software set to scan the DFSR-Private folders (Staging, Conflict and Deleted, etc) ?
-Ken