We are having a performance issue that we cannot explain with our VMWare environment and I am hoping someone here may be able to help. We have a web application that uses a databases backend. We have an SQL 2005 Cluster setup on Windows 2003 R2 between a physical node and a virtual node. Both physical servers are identical 2950's with 2x Xeaon x5460 Quad Core CPUs and 64GB of memory, 16GB allocated to the OS. We are utilizing an iSCSI San for all cluster disks. The problem is this, when utilizing the application under a repeated stress testing that adds CPUs to the cluster nodes, the Physical node scales from 1 pCPU to 8 pCPUs, meaning we see continued performance increases. When testing the node running Vsphere, we have the expected 12% performance hit for being virtual but we still scale from 1 vCPU to 4 vCPUs like the physical but beyond this performance drops off, by the time we get to 8 vCPUs we are seeing performance numbers worse than at 4 vCPUs. Again, both nodes are configured identically in terms of hardware, Guest OS, SQL Configurations etc and there is no traffic other than the testing on the system. There are no other VMs on the virtual server so there should be no competition for resources. We have contacted VMWare for help but they have not really been any suggesting things like setting SQL Processor Affinity which, while being helpful would have the same net effect on each box and should not change our results in the least. We have looked at all of VMWare's SQL Tuning guides with regards to VSphere with no benefit, please help!
You've got a pretty neat setup here ;-)
Are the vCPU's used to their max capacity? What are the graphs for the CPU wait, CPU ready and CPU usage telling you?
As more vCPU's you add to a VM as more overhead is generated on the host system to manage the vCPU's and map them to the physical one. At some point you wont get more performance out of a VM by simply adding more vCPU's.
Did you check if there is a performance issue on the iSCSI SAN? Check the graphs disk read and disk write requests and of course the disk read and write rates and compare them to the ones of the phyiscal cluster member.
Maybe some of the values can point you into the right direction.
I'm hesitant to enter this as an answer because I don't have much concrete support for this, but it could be the cause of the problem you're seeing. I've heard before (and this page sort of supports it) that VMWare CPU scheduling has a harder time when a VM has multiple CPUs. For a single CPU VM, only a single host processor needs to be scheduled. However, when a VM has more than one, VMWare has to schedule multiple processors to be available for the VM, which can take longer. This would become more and more difficult to schedule as the number of VM CPUs increases, which means the VM actually sees worse performance because it has a harder time getting processor time allocated to it.
Also, I re-read your last comment, and I've also been having issues lately with contention in vSphere. One other thing you might want to try (if you haven't already) is to increase the resource allocation for this VM, even though it's the only VM there at the moment.
Just to clarify, you're using an 8-core (2 x 4) ESX box to host a single 8 vCPU VM and seeing no real performance gain for the 5th and subsequent vCPUs right? Can I ask the question why you're not using the same host as a physical SQL box instead? You've using ~$5-6k of Enterprise Plus licences there for what appears to be no benefit (even if you weren't seeing performance issues) - I don't get it sorry.