First of all the back story -
All of a sudden (literally overnight) an instance starts throwing CPU utilization alerts. This is a rather lowly VM (1 vCPU, 2GB RAM) but all it does is very low NFS serving and Cacti polling and serving for a handful of systems. This VM is hosted at an IaaS provider on vSphere 4.x and sits upon enterprise kit (HP/NetApp SAN, etc.).
The last time I changed anything on this system was nearly 4 weeks ago. Looking over the metrics one of the provider's agents/processes used by McAfee (cma) consumed WAY more RAM than usual until a cron job I have restarted the service the weekend prior (the cron job is there because I'm convinced this agent has a memory leak). Anyway, the problem is that I cannot run Cacti (httpd/mysql/php cron job that runs poller.php) on this system anymore - the load will go up over 10 and iowait is really high (~ 90%). I've tried the following:
- run Cacti with the McAfee service stopped
- systematically updated php*, httpd/mod_ssl, mysql-server, after each trying to run Cacti
- yum update to all latest packages, it's now RHEL 5.8 (x86_64)
The yum update (all) put the system over a load of 6 and took hours.
I asked the hosting provider if there was anything wrong with the storage layer but they said there wasn't. But this just doesn't compute. This got me wondering if maybe there could be a problem with partition misalignment since I've read that it can cause the kind of symptoms I seem to be experiencing. Now the provider would have created these VMFS partitions in the vSphere/vCenter client which I understand ensures that there is alignment. But can it get out of alignment over time? If so, is there any way from a VM/Guest that you can detect this? The mbrscan (NetApp) utility looks like it detects but that has to be run from the host's ESX console.
Thanks!
Edit: sfdisk output with uS added:
[root@nfs1 ~]# sfdisk -luS /dev/sda
Disk /dev/sda: 13054 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System
/dev/sda1 * 63 208844 208782 83 Linux
/dev/sda2 208845 164055779 163846935 83 Linux
/dev/sda3 164055780 209712509 45656730 8e Linux LVM
/dev/sda4 0 - 0 0 Empty
Update:
A reboot of this instance completely solved the performance problems. Further analysis by the Hosting Provider did indicate that there is some misalignment but in their opinion it would not result in the symptoms experienced. They say for example that misalignment in Windows VMs is greater. At this point we're going to wait and see if it happens again and if so change the sector offset.
The only way to see alignment issues is to measure the master boot record. If you can do that from your VM, you can see whether you're misaligned.
That said, alignment problems magnify the number of IOs you do to the storage, but there would have to be some limitation in place to prevent you from doing this increased number of IOs. Netapp is particularly hard hit by this because they start limiting performance as soon as the number of "partial writes" that need extra attention by their back-end hits a certain level. Other systems just treat each IO the same way as the last one, so don't have that massive spike in storage latency that Netapp gets.
You should be able to find out the guest alignment with sfdisk on Linux. Just look that the the start sectors of your partitions. But, that will only tell you half the story since your provider can/should account for the default OS alignment at the storage layer.
So even if it seems misaligned at something like 63 sectors, the storage may have an offset into the LUN or data store to correct it to an aligned boundary. But at least you can take your new knowledge to your provider and have them confirm.
Update (for new sfdisk results): None of your partitions are aligned on the same 4KB or 8KB block boundaries, so it's quite likely that you are experiencing some misalignment pain. You need to ask your provider what block alignment the storage uses (e.g. 4KB) and what alignment correction they use, if any. If they don't have any alignment correction, you want all of your partitions to start at a sector count evenly divisible by 8 or 16. While you are at it, an even 1MB start offset (evenly divisible by 2048) allows for any underlying storage block size changes in the future.