I'm analyzing a problem where performance of CPU-bound workloads inside virtual machines often (not always) is way below what we would expect based on the underlying hardware.
We're using Hyper-V on Windows Server 2012 R2. The server has dual Intel Xeon E5-2643 v2 @ 3.50 GHz.
Here are some figures that seem to be relevant:
- Hyper-V Hypervisor Logical Processor, % Total Run Time, Instance _Total: Average 20%
- Hyper-V Hypervisor Virtual Processor, CPU Wait Time Per Dispatch, Instance _Total: Average 20000 (this number seems to be totally on the safe side, so it doesn't seem like the hypervisor has to "steel" from virtuals CPU to schedule time to logical CPUs of another VM; seems to translate in an overhead of 2%)
- Hyper-V Hypervisor Logical Processor, % of Max Frequency, Instance _Total: Average 34%
- CPU-Z tool shows most of the time around 1200 MHz for Core #0 of both processors (pretty much matches the % of Max Frequency reported by Performance Monitor)
On a desktop with only a few cores, core speed goes up immediately as soon a CPU-bound activity starts.
On our Hyper-V hosts however, core speed seems to go up only if overall system load seems to be high for a few seconds. Now e.g. if you have a VM with 4 virtual CPUs out of a total of 24 physical (with Hyperthreading turned on), and this VM needs CPU power and Task Manager inside the VM shows nearly 100% CPU usage, most of the time the clock speed of the physical CPU won't go up and performance is bad.
Obviously this is unwanted behavior. Think of a database server that needs 3x the time to answer a query because the server has not "enough" load to step up CPU frequency. That doesn't make any sense.
I found a blog post describing the exact same behavior for VMWare and Cisco blades, from 2011. I didn't find information on this anywhere else.
I was actually able to get rid of this behavior by switching to the Windows "High performance" power plan in powercfg.cpl
, at the cost of around 30% higher power usage. I actually get better and more consistent performance and Performance Monitor shows lower load figures.
(On an older server, I found an additional setting "Processing power management | Minimum processor state" which could be set to 100% without disabling all other power saving options. The new ones show only "system cooling policy" which is on "Active" even for the "Balanced" plan, so my only option was to choose "High performance".)
Is this really best practice for Hyper-V hosts, or is there any other workaround? If SpeedStep is really a problem, I wonder why they even build it into server CPUs and enable it by default and why I never read about this setting in a Hyper-V configuration guide?
After a bit more searching it seems like this is a general problem with modern server CPUs, even unrelated to virtualization, and major server vendors as well as software vendors like Microsoft and VMWare ship their products with default settings that artificially limit your CPU performance. I still find that hard to believe.
The solution for anybody who cares about having instantly access to full CPU power per core without all the cores being busy at first, is to disable power saving (Intel SpeedStep/EIST or AMD Cool'n'Quiet). Depending on your BIOS setting, this can be controlled on the OS level (like on Windows
powercfg.cpl
"High-Performance" plan), or via BIOS, in this case the OS setting is grayed out.Brent Ozar wrote on this ("SQL Server on Power-Saving CPUs? Not So Fast.") in 2011:
Microsoft says in KB2207548:
There is a hotfix available for Win2008R2, and a BIOS update is recommended, but since this is an issue still with Win2012R2 it seems there is no way around the second recommendation, "High performance" plan.
An issue with similar symptoms is described in KB2534356 which also offers a hotfix for Win2008R2 only. So for me only the usual workaround applys (High performance plan), but it sounds like a fix could be possible in the future. (It works great on desktop CPUs, so I don't understand why it shouldn't be possible on the server.)
I will update this answer in case I might find a better solution (or of course will change the accepted answer if someone else is posting a solution).
Still wondering if EC2 or Azure might have the same issue (in this case you wouldn't be able to do anything about it since you need control over the host, changing the setting in the VM won't have any effects).
Some more references:
I've only seen this sporadically. In theory speedstep (which is increasingly becoming a non configurable option it seems), should not affect your performance. However when it's stepped down, and you suddenly get busy in a single VM, sometimes it seems like the processor just doesn't think it's enough. I'm not sure this is a Microsoft issue, since as you mentioned VMware and Cisco have the same issue.
Its a bios feature on server to remove speedstep. On IBM blade its a default bios option : no speedstep and sometime no turbo. check that blog technical details :
https://workinghardinit.wordpress.com/tag/c-states/
"Depending on your findings and needs you might just want turn SpeedStep or Cool’n’Quiet off either in the BIOS or in windows" So if you have problems just turn it off.
And in the bios check for virtual instruction not turned off in bad bios.
Don't forget to add "hyperv tools" for better "virtual aware os" on old os.