We have a DFS infrastructure with 3 servers and 93 replicated folders. When I run a health report from the DFS Management console, the status of one of these folders is listed as "uninitialized". This folder has previously replicated normally.
Rebooting all 3 DFS servers resolves the "uninitialized" state and the folder appears to begin replicating normally. However, it will fall back into an "uninitialized" state rather quickly, usually within a week.
I've been monitoring this folder in DFS and it does appear that vast numbers of changes will hit this folder in very short periods of time - i.e. the replication backlog will jump up to over 100,000 entries in the early mornings during the weekday. Ordinarily, the backlog goes down quickly over the next couple hours, so I haven't worried about it.
However, this "uninitialized" status now means no replication is taking place at all on servers where the folder has this status. Which means now we have a problem. I haven't tracked down specific files or causes, but I've sent out inquiries to desktop team to help identify what's causing the backlog.
I have found no event log errors related to this folder or status. I thought maybe the high number of file changes on the volume might be causing journal wrap errors, but I haven't found any event logs related to USN journal wrap. The folder does have consistent sharing violations, but these would all ultimately resolve themselves once the files were closed prior to this "uninitialized" problem.
My research has turned up nil, except for possible configuration xml corruption, but in those cases the problem was only with sysvol replication.
My only hypothesis is that DFSR is automatically setting the status to "uninitialized" when the number of differences passes beyond a certain threshold. But I'm not able to test this hypothesis and I can't find any documentation to back it up. And even if it's true, I don't know how I would go about "reinitializing" the folder.
The servers involved are:
A: Sending server, 2008r2, staging quota 25 GB, status: Normal
B: Receiving server, 2008r2, staging quota 175 GB, status: Uninitialized
C: Receiving server, 2012r2, staging quota 25 GB, status: Normal
All three servers are pulling double duty as AD domain controllers. All 93 replicated folders are in the same replication group, so deleting and recreating the RG would be time-prohibitive. When this problem first occurred, a small handful of other folders also showed this status, but only this one folder has had the problem recur after rebooting. The affected folder is 202 GB in size with 547,252 files.
What is causing the folder to become "uninitialized" and how do I resolve this?
-Edit- Some more information. The receiving server rebooted midnight yesterday (~36 hours ago). This brought the folder into "Normal" status and a backlog began generating. When I checked it yesterday, the backlog on this folder was 205,662 files. When I checked today, the backlog is 579,447 files. The folder presently has only 551,706 files. The backlog is bigger than the folder size. The DFS health report says 851,592 files have been received in this folder. So far, no other folders are having a problem like this.
I don't know if the backlog is causing replication to fail, or if replication is failing and causing the backlog, or if there's some underlying database or journal log corruption causing both failed replication and a backlog. Nor do I know how to resolve the problem in either case.
Right now there's a single replication group for 93 folders. I'm about ready to blow it away and configure 93 replication groups. If that doesn't solve the problem, at least it will make it easier to troubleshoot.
0 Answers