I'm a DBA and manage a vmware ESX 3.5 cluster that predominently hosts SQL Servers and a few application servers and I have a question about how to setup the resource groups, but I'm in conflict with one of the ESX system admins about how to manage the resources.
The cluster (3 nodes, 32GB per node) currently hosts 33 guests configured to consume 77GB of RAM, although ESX is reporting that only 44GB is active. The cluster hosts live, test, development servers and a few other miscellaneous guests.
What I'd like to do is simplify the management of the servers resources, and to be able to manage and report the performance of related servers.
For example, the resources consumed (RAM, Disk, CPU) for the Live SQL servers, the SharePoint servers, the CRM servers etc.
What I have next done is create 4 "top level" resource groups.
1-High - For the most mission critical services (ie. the live SQL server)
32768 memory shares
2-Normal - For the majority of the remaining live systems (CRM, Sharepoint etc)
16384 memory shares
3-Dev - Test and development systems
8192 memory shares
4-Low - Non supported servers (no sla, temporary build servers etc)
1024 memory shares
I have grouped the servers into their own "application" resource groups (SQL Live, SQL Test, CRM Live, CRM Test etc) but have not set any explicit resource limits on these groups.
And then I put the "application" groups into the appropriate "top level" resource group.
For example, each sub group has 4 guests, each 1 CPU and 1GB RAM
1-High 32768 shares
SQL Live 4 guests
2-Normal 16384 shares
CRM Live 4 guests
Sharepoint Live 4 guests
3-Dev 16384 shares
CRM Test 4 guests
SQL Test 4 guests
Sharepoint test 4 guests
4-Low
Remaining cruft 4 guests
The sysadmin chap is telling me that "Sharepoint will only get 28% of 50% of the resources it needs!"
Before I reply to him, can I get some advice and a check on my assumptions:
- In normal operation the cluster is not overcommitting RAM (or CPU) so there is no resource limits being applied to any guest, either CPU or RAM.
- If one of the hosts fails, then there will only be 64GB of RAM available. As the guests are restarted (we have HA and DRS enabled) the remaining hosts will start to restart the guests and this will overcommit the RAM.
- I want to ensure that the highest priority services maintain their service
- I dont want to micromanage each individual guest!
What are your thoughts and expericences??
If I'm reading this correctly then you are correct about the normal operation of your environment but I'm not sure if either of you are correct about how it works when contention arises.
When there is no contention ( contention starts when resource utilization exceeds 80% BTW) then shares have no effect. So as far as normal operations in your environment are concerned the Resource Groups will be cosmetic.
When there is contention then CPU resources will be constrained as your sysadmin has indicated but that wont necessarily happen if you lose a host.
You don't say whether you have modified shares on the child resource pools. I'm going to assume these are all set to normal.
Assuming that there is contention though the way shares work is that each Resource Pool gets the proportion of the resources that is equal to its fraction of the total quantity of shares at that level. For your first level you have ~58k shares so the High Pool gets approx 56%, the normal gets 28%, Dev gets 14% and Low gets 1.7%. Within each Pool the sub-pools share the resources of that pool equally unless you have explicitly set additional shares at that level, if you have the same rules apply but the total for the pool remains unaffected.
So in your case when contention arises the Live Sharepoint systems will get 50% of 28% of contended resources, ie 14%.
You can help things along somewhat by allocating reservations for the absolute minimum values of CPU and RAM that each system needs. The reserved values are guaranteed to the systems\resource pools you allocate them to and are not allocated by shares. The key drawback with them is that if the values are too high the cluster may be unable to even attempt to restart the VM's as the resources cannot be guaranteed.
Also remember that even though your systems only consume ~44GB under normal operation with Windows systems 100% of memory gets (briefly) allocated when a VM is started up. This can trigger a contention scenario for memory during a failover even though there is actually enough RAM for the systems once they are running. It's something to keep an eye on more than worry too much about but it can cause problems during HA restarts.
Edited to add
If you've made no changes to the default share settings on individual VM's or Child Resource Groups then the proportion of resources allocated to individual VM's will not change when you move all VM's up a level in a structure where there is only a single Child RG and place them directly in the parent. However if there are multiple child RG's and different numbers of VM's in each then this isn't true.
In your example say we have your 4 Sharepoint VMs in their child RG and 2 CRM VMs in their child group. The Sharepoint VM's get ~3.5% each (50% of 28% / 4) and the CRM VMs get 7% each (50% of 28% /2). If you now move all of them up to the parent RG and delete the empty child RG's you now have 6 VM's sharing the 28% of resources available to the Normal RG and each one will get ~4.7% (28% / 6).
Of course if you change the shares on the child Resource Groups or individual VM's this will all change.
Resource definitions only ever take effect in an overcommitted cluster.