I have CPU I/O wait steady around 50%, but when I run iostat 1
it shows little to no disk activity.
What causes wait without iops?
NOTE: There no NFS or FUSE filesystems here, but it is using Xen virtualization.
I have CPU I/O wait steady around 50%, but when I run iostat 1
it shows little to no disk activity.
What causes wait without iops?
NOTE: There no NFS or FUSE filesystems here, but it is using Xen virtualization.
NFS can do this, and it wouldn't surprise me if other network filesystems (and even FUSE-based devices) had similar effects.
Is there any chance other VMs on the server are thrashing the disk?
I know with virtualisation that you can get some strange results if the host node is overloaded.
If this is the Amazon EC2 Xen environment using instance-based storage, ask Amazon to check the health of the host containing this image.
If this is a Xen environment that you can gain access to the hypervisor, then check the IOwait from without for the disk image (file, network, LVM-slice, whatever) being used for the xvda and xvdb devices. You'll also want to check the I/O system, in general, for the hypervisor since other disk devices might be monopolizing the system's resources.
is usually a good starting diagnostic tool. It takes 5-second summaries of I/O for ALL devices available to it, and thus is useful both with-in and wither-out the VM image.
Check your available file descriptors / inodes. When you hit the limit, they swap and mimic iowait
Edit
I saw you are using xen, have a look at your current interrupts, you might find blkif is higher than normal.
Bit late now, but get munin installed and it will really help future debugging.
Then check dmesg to see what is performing block read / writes or dirtying inodes.
Also check nofile limit in limits.conf, a process could be requesting more files than it is permitted to open.
WARNING: HDPARM IS DANGEROUS, ALWAYS READ ABOUT THE COMMAND YOU ARE GOING TO USE!
If no other virtual machines are stressing the hard disk(s), do
on the underlying physical disk(s). Possibly the disk cache don't work accurately. This will flush the data stored in the cache, and you can constantly monitoring the I/O, whether it is about to rise again after the flush. If yes, it will be a cache problem.
With load average, I've seen blocked networking operations (i.e. long calls to an external DB server) increase. I don't know for sure but I'm guessing network IO can cause CPU wait to go up? Can anyone confirm?
Could be loopback devices, that are themselves mounted over the network.
On my machines NFS is the biggest IO-WAIT "producer". I have a SSD in my laptop which is fast as hell, so "real IO" is not the problem. Nevertheless I have sometimes lots of IO wait due to my mounted nfs shares.
SCP sometimes also seems to lead to IO Wait but to a far lesser extend.
This can be anything. It just means that something is waiting for end of I/O operation. You can figure out what process it is via ps, then attach gdb to it and check out backtrace to determine which call is hang (usually this is some network-related stuff or suddenly disconnected disk). For fd info, check out /proc.