Doing a top
to check the io wait, I get these figures:
Cpu(s): 6.7%us, 1.4%sy, 1.2%ni, 85.5%id, 5.0%wa, 0.0%hi, 0.3%si, 0.0%st
Looking at these figures (%us ~= %wa), do they mean that:
- there are almost as many CPU processes waiting than working? (=> bad)
- the working processeses are waiting 5,0% of their execution plan? (=> ok in this case)
- something else
You need to be careful when evaluating these figures.
IOWait in this context is the measure of time over a given period that a CPU (or all CPUS) spent idle because all runnable tasks were waiting for a IO operation to be fulfilled.
In your example, if you have 20 CPUs, with one task really hammering the disk, this task is (in effect) spending 100% of its time in IOWait, subsequently the CPU that this task runs on spends almost 100% of its time in IOWait. However, if 19 other CPUs are effectively idle and not utilizing this disk, they report 0% IOWait. This results in an averaged IOWait percentage of 5%, when in fact if you were to peek at your disk utilization this could report 100%. If the application waiting on disk is critical to you -- this 5% is somewhat misleading because the task in the bottleneck is seeing likely much higher performance issues than going 5% slow.
Probably, remember for the most part CPUs run tasks and tasks are what request IO. If two separate tasks are busy querying the same disk on two separate CPUs, this will put both CPUs at 100% IOWait (and in the 20 CPU example a 10% overall average IOWait).
Basically if you have a lot of tasks that request IO, especially from the same disk, plus that disk is 100% utilized (see
iostat -mtx
) then this is bad.No. The working processes are almost certainly waiting full-time for IO. It's just the average report case ("the other CPUs are not busy") fudges the percentage or the fact that the CPU has many tasks to run, of which many don't need to do IO.
As a general rule, on a multi-CPU system, an IOWait percentage which is equal to the number of CPUs you have divided by a 100 is probably something to investigate.
See above. But note that applications that do very heavy writing are throttled (stop using writeback, start writing directly to disk). This causes those tasks to produce high IOWait whilst other tasks on the same CPU writing to the same disk would not. So exceptions do exist.
Also note if you have 1 CPU dedicated to running 2 tasks, one is a heavy IO read/writer and the other is a heavy CPU user, then the CPU will report 50% IOWait in this case, if you have 10 tasks like this it would be 10% IOWait (and a horrific load), so the number can be reported much lower than what might actually be a problem.
I think you really need to take a look at
iostat -mtx
to get some disk utilization metrics, andpidstat -d
to get some per-process metrics, then consider whether or not the applications hitting those disks in that way are likely to cause a problem, or other potential applications that hit those disks being likely to cause a problem.CPU metrics really act as indicators to underlying issues, they are general so understanding where they may be too general is a good thing.
It means that 5% of CPU time is spent waiting for disk IO to finish, and 6,7% CPU time is spent to actually do the processing required by userland process.
Check vmstat output ; e.g.
vmstat 1 30
as long as process count in columnb
does not pile up you're good. Columnb
indicates number of processes in uninteruptable state (D state) which are blocked until disk IO operation finishes.So answer to your questions
No time is roughly the same but this is not necessarily a problem. As long as you dont have problem where processes start piling in D state, you are good. Improvements may include adding more RAM to have more room for pagecache (diskcache) to reduce number of disk reads, and rather read from in memory cache, tuning disk scheduler.
This is portion of CPU time spent on handling userland processes; there is nothing to be worried about here especially with so much idle
85.5%id
CPU timeThe wait state is when a process that is otherwise runnable is stopped waiting for IO. It's a sign of contention, usually for disk resources.
It does mean that some of your processes aren't running as fast as they could, but that's pretty normal.