We have a number of Server 2012 systems, all of which run virtualised on Hyper-V 2012 server. We are having problems with two such virtual instances, both of which are used as file servers, whereby they occasionally stop responding to requests to serve files to clients. After logging on to the server, attempts to shut it down gracefully fail (no error, it just fails to acknowledge a shutdown request).
Recovery is a case of power cycling the server(s) from the Hyper-V console.
These two servers don't serve a large number of users (one serves no more than 6 users, and the other serves around 20 users), they are in the same domain, but on different physical hardware (and at different sites). They don't lock up at the same time. They both use DFSR to replicate a fairly large amount of data between themselves (200GB) over ADSL connections, this is working fine, and we have been using DFSR to do this on the previous two generations of server OS we have used (Server 2008 R2 and Server 2003 - both of which were physical installs however).
Today, when one of the servers crashed, I noticed an entry in the event log, which looked similar to the following:
Log Name: Application
Source: ESENT
Date: 27/11/2012 10:25:55
Event ID: 533
Task Category: General
Level: Warning
Keywords: Classic
User: N/A
Computer: HAL-FS-01.example.com
Description:
DFSRs (1500) \\.\E:\System Volume Information\DFSR\database_C8CC_101_CC00_EC0E\
dfsr.db: A request to write to the file "\\.\E:\System Volume Information\
DFSR\database_C8CC_101_CC00_EC0E\fsr.log" at offset 4423680 (0x0000000000438000)
for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is
likely due to faulty hardware. Please contact your hardware vendor for further
assistance diagnosing the problem.
When the server started up again, I went to find the event log entry to investigate further and found that the event log entry was no longer there (I assume it was in memory but failed to write to disk before the server was powered off, for the reason mentioned in the message). I found the above message by searching back further in the event log.
Both of these virtual servers have their E: volumes fully allocated as opposed to dynamically expanding, and there are no other issues on any of the other virtual servers (which include server 2012, server 2008 R2 and Ubuntu 12.04 x64). There are no signs of IO, memory or CPU starvation on the host systems.
I've used performance counters on the affected virtual servers to monitor memory usage (including non paged pool usage), as well as CPU and network utilisation, and none of these show any signs of trouble when the issue arises.
I would have thought our configuration isn't that uncommon, so I'm wondering if anyone else has seen this, and managed to resolve the problem?
The host specifications are as follows:
hal-vm-01
running a total of 5 virtual servers (affected file server, DC + other guests) is a Dell Poweredge R710, 16GB, 6 x 300GB SAS 15K RAID 10, Perc H700
hey-vm-01
System running 2 virtual servers (affected File server and DC) Dell Poweredge T620, 16GB, 2 x 3TB SATA RAID 1, Perc H310
We have a further virtual server hal-vm-02
running 5 guests, which is unaffected by this problem and is a lower spec than hal-vm-01
, but loaded about the same (exchange, DC, SQL + other guests). More memory is on the way so that we can configure shared nothing failover between this host and 'hal-vm-01'.
There is AV software (MS SCEP) running on the two virtual servers that are affected, they are configured to scan on create only, and to not scan files created by the dfsrs.exe process. There is no AV software running on the VM hosts themselves.
We are using Windows Server 2012 backup on the host hal-vm-01
to backup all the VMs, this runs out of hours. The other affected server hey-vm-01
isn't backed up, as it's just an off site DFSR copy of the data at our main office. Another backup job runs on the affected virtual guest hal-fs-01
, this also uses Windows Server backup, to take snapshots of the data stored in the DFS replicated shares. Both backup jobs run out of office hours.
Three months later...
We've had a support ticket open with Microsoft for over three months now, there have been lots of performance counter logs, memory dumps, event logs sent to Microsoft. The analysis they've performed indicated a problem with one of the virtual drives of the hal-fs-01 (the virtual server with the problem). The virtual drive in question was the server's E:\
drive, which just happened to have all our DFSR groups and shares. Recently, I moved all data off the E:\
drive to many smaller virtual disks that I added to the server, and of course moved all the shares and DFSR groups, leaving just Windows Deployment Services files on the E:\
drive. Despite this, we still saw the problem with writes to the E:\
drive failing.
Last week I've moved the WDS files to a new virtual disk and also disabled the WDS service. I've also deleted the E:\
virtual disk just in case there was some anomaly with the disk. Since then, we've not yet had another failure, however it's too early to know if this has fixed the problem, as our longest up time was previously around 2 weeks, as of the time of this edit (20/03/2013), we are only one week into the current config, if the problem hasn't surfaced again by next week, I'll be re-enabling WDS, as I have a suspicion that WDS could be the culprit.
I'll keep this question updated (or provide an answer if I manage to resolve the problem).
Moved back to Server 2008 R2...
Not updated the question with progress, but we ended up rolling back to Server 2008 R2, everything works fine. I'd still be interested in hearing about anyone having this issue and managing to find a fix.