We have a busy server, chocking under a high IO load, at least, that's the feeling I have. Output from iostat -xz
looks like this:
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd5 224.8 157.8 10701.8 6114.7 0.0 9.5 24.7 0 100
sd5 243.2 110.4 11565.3 4065.0 0.0 9.7 27.5 0 100
It's obvious that the disk subsystem is overloaded, since a 25ms service time is unacceptable for a 6 drive SATA array, and a 100% busy also means we're chocked on disk IO.
But - why is wait
always 0.0
? And why is %w
is also 0? %w
sometimes goes to 1
, and quickly returns to 0
.
Doesn't this mean that no process is waiting for IO?
Does the RAID controller somehow causes this result / masks the wait times?
Can someone explain this behavior?
The svc_t time measures in milliseconds the "round trip":
"bottom" of the operating system - disk subsustem - "bottom" of the operating system
It is not completely correct that "100% busy means we're chocked on disk IO". This means that the disk was busy 100% of the time doing something, not necessarily that it cannot do more than this nor that it does not serve requests in time (this is a subtle difference).
Usually the symptoms of overloaded disks are high values on the %w column and actc (steadily over 200).
Could it be a latency problem? Does the system request lots of random operations so that the controller spends time looking for the 6th data chunk?
Yes, I think you're correct in the RAID controller messing up the numbers. If it tells the driver the operation has started as soon as it's requested, the driver won't know it's still waiting for the disk hardware inside the RAID controller. Can you pull stats off the RAID controller directly?