We're a small shop, running a Dell T420 (dual CPU, only one present, 6 cores) w/32GB RAM as our main server. We have only 5 VMs, one of which is our WSE 2012 DC.
From time to time, and at a rate for which we've not been able to establish a reliable pattern, all of our VMs concurrently spike to 100% CPU. The host remains quiet at 4-5%. A host warm boot doesn't provide relief, but a cold boot at least puts things back in the box until the problem reoccurs.
Sometimes we can get a week or more of calm seas out of it; sometimes only a day. An unreliable pattern seems to be that it kicks off sometime during an extended idle period, i.e. overnight. An examination of the server's temperature logs first led us to suspect overheating, but further investigation into recent incidents have spoiled that lead.
We also found descriptions of similar problems on the Dell forums, with claims of resolution by installing the latest round of Dell updates. We recently engaged in a project to do just that (as an aside, it was quite an adventure getting ~700GB of VHDs safely off of and then back onto that machine), but to our utter dismay it didn't help.
We're absolutely befuddled. So is Microsoft support (or at least first tier support is, even though they try not to act like it). I'm including below our SystemInfo output.
Does anyone know where to start looking?
Thanks
===================================
Host Name: SERVER1 OS Name: Microsoft Hyper-V Server 2012 R2 OS Version: 6.3.9600 N/A Build 9600 OS Manufacturer: Microsoft Corporation OS Configuration: Standalone Server OS Build Type: Multiprocessor Free Registered Owner: Windows User Registered Organization: Product ID: 06401-029-0000043-76293 Original Install Date: 4/3/2014, 4:07:15 PM System Boot Time: 5/4/2014, 1:56:47 PM System Manufacturer: Dell Inc. System Model: PowerEdge T420 System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: Intel64 Family 6 Model 45 Stepping 7 GenuineIntel ~2200 Mhz [Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20 GHz] (manually added) BIOS Version: Dell Inc. 2.1.2, 1/20/2014 Windows Directory: C:\Windows System Directory: C:\Windows\system32 Boot Device: \Device\HarddiskVolume1 System Locale: en-us;English (United States) Input Locale: en-us;English (United States) Time Zone: (UTC-09:00) Alaska Total Physical Memory: 32,723 MB Available Physical Memory: 12,716 MB Virtual Memory: Max Size: 37,587 MB Virtual Memory: Available: 17,129 MB Virtual Memory: In Use: 20,458 MB Page File Location(s): C:\pagefile.sys Domain: OIT Logon Server: \\SERVER1 Hotfix(s): 31 Hotfix(s) Installed. [01]: KB2843630 [02]: KB2862152 [03]: KB2868626 [04]: KB2876331 [05]: KB2883200 [06]: KB2884846 [07]: KB2887595 [08]: KB2892074 [09]: KB2893294 [10]: KB2894179 [11]: KB2898514 [12]: KB2898871 [13]: KB2901101 [14]: KB2901128 [15]: KB2903939 [16]: KB2904266 [17]: KB2908174 [18]: KB2909210 [19]: KB2911106 [20]: KB2913760 [21]: KB2916036 [22]: KB2917929 [23]: KB2919394 [24]: KB2919442 [25]: KB2922229 [26]: KB2923300 [27]: KB2923768 [28]: KB2928193 [29]: KB2928680 [30]: KB2930275 [31]: KB2939087 Network Card(s): 3 NIC(s) Installed. [01]: Broadcom NetXtreme Gigabit Ethernet Connection Name: NIC1 DHCP Enabled: No IP address(es) [02]: Broadcom NetXtreme Gigabit Ethernet Connection Name: NIC2 DHCP Enabled: Yes DHCP Server: 192.168.1.12 IP address(es) [01]: 192.168.1.135 [02]: fe80::915b:8de0:712e:29f1 [03]: Hyper-V Virtual Ethernet Adapter Connection Name: vEthernet (External NIC 1_Internal) DHCP Enabled: No IP address(es) [01]: 192.168.1.11 [02]: fe80::2d35:f582:4958:9eb2 Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
== EDIT ======================
I've found the solution to this issue; I waited for over a year to make sure we didn't encounter any more instances of the problem.
Moderators: I'd like to request a reopening of the question, so that I can post the answer.
After over a year of waiting so as to prove the solution as valid, I'm finally able to post this answer.
Dell's default BIOS settings have C-States enabled, which puts the computer in low-power mode during idle times. This is what causes the VMs to spiral into 100% CPU usage on a Hypervisor host (VMWare, Citrix included).
The solution is to set the System Profile setting in the BIOS to Performance, as opposed to Performance per watt [OS] or Performance per watt [DAPC] (the latter being the default).
The relevant Dell documentation, pp3:
http://en.community.dell.com/techcenter/extras/m/white_papers/20161975/download
And this reply from one of the few Dell support engineers who's familiar with the issue:
In a nutshell, power idling on a Dell server should always be turned off (set to Performance) for Hypervisor hosts.
Thanks to Eddy Simons at Kitsap Bank for helping me to find this solution.
It's unclear as to what the problem is; you already know that. We have no chance of telling you what the cause is.
However, you can run some tests:
Build VM 1
(Perform millions of complex mathematical calculations per second)
Build VM 2
(Create a giant array in memory, delete it, repeat)
Build VM 3
(Read/write/delete millions of lines to/from a file)
Build VM 4
(Copy files to/from a SMB share)
Wait until the problem occurs again, observe performance data on each of these servers.
Which was most affected?
Were any not affected at all?
My guess is that your disks suck and the CPU is waiting for IO operations to complete before continuing, which can cause some applications to flatline the CPU.
Glad I found this. I have a 2012R2 server running Hyper-v. AMD, 6-core cpu. It had been running perfectly for over a year. Suddenly I started seeing VMs that could not be connected to - not with RDP, nor with Hyper-V connect. The only option was to TURN OFF the VM. Shut down did not get a response. So... pull the virtual plug out of the wall. Turn on.
The symptom was that the individual machine seemed to be using 100% of it's allocated CPU (ex: a one-core VM on a six-core host was pegged at 16%).
The problem was sporadic. No apparent rhyme or reason.
It finally occurred to me that this was coincident with my failed attempt to upgrade from 32 to 64GB on that mobo. THAT problem was that I could get 1, 2 or 3 sticks of 16GB memory to work for 16, 32 or 48GB, but not four sticks for 64GB. Lots of horsing around with bios settings, etc. No joy on that front. That's when I discovered the wonderful feature on the VM to Enable Dynamic Memory. Turns out I could survive without the 64 gig after all!!
I'm guessing that I turned on power management for the CPU in my tinkering, and then this issue appeared.
I have turned off APM in the bios. It'll take a couple days before I'm 60% confident that this fixed it. A couple weeks to declare victory. But this FEELS like a good reason for the problem.
It's been 24 hours now and so far so good.
Fingers crossed.
Thanks for the information!!