We have an Ubuntu 10.4 VPS serving a Rails site which often shows pretty high load, but doesn't have high CPU or memory numbers. Reading a lot of other questions here on Server Fault suggests to me that this is an I/O issue (i.e. there are processes which are stuck in I/O wait state and therefore driving up load). I'm trying to track down those processes, but not having much luck. I'd appreciate help with (a) ways to identify the guilty processes, and/or (b) confirmation that I'm asking the right question.
Here's a snapshot of top
:
top - 18:28:49 up 5 days, 3:07, 2 users, load average: 1.79, 1.83, 1.73 Tasks: 82 total, 1 running, 81 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st Mem: 1794980k total, 1780384k used, 14596k free, 13356k buffers Swap: 524284k total, 3116k used, 521168k free, 1012272k cached
Notice low swap, CPUs mostly idle; that's why I think we're I/O bound instead of memory or CPU bound.
Here's iostat
(I've obfuscated the server name):
$ iostat -x 1 3 Linux 2.6.35.2-xenU (our.server.com) 03/25/11 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.75 0.19 0.50 0.31 0.01 97.24 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.01 11.52 2.19 3.18 145.12 117.55 48.97 0.08 15.60 1.67 0.90 xvdap9 0.01 0.01 0.00 0.00 0.10 0.14 62.62 0.00 13.20 6.09 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.00 0.00 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdap9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.00 0.00 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdap9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
iotop
won't run on this box:
$ iotop Could not run iotop as some of the requirements are not met: - Linux >= 2.6.20 with I/O accounting support (CONFIG_TASKSTATS, CONFIG_TASK_DELAY_ACCT, CONFIG_TASK_IO_ACCOUNTING): Not found - Python >= 2.5 or Python 2.4 with the ctypes module: Found
ps
seldom finds any processes in the D state:
$ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }' 976 root D [kjournald] $ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }' $ sudo ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }' $
What's my next troubleshooting step?
ETA: I ran vmstat
:
$ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 3116 509372 22880 773232 0 0 18 15 24 14 2 0 97 0
That wa
value of 0 makes me wonder if I/O is really the problem.
Also, yes, I know load in the 1.x range isn't really a problem - but this app has a history of ramping up load until it chokes, and if I can track the source while it still has a low fever I might spare a fatality (to torture a metaphor).
I would recommend searching for anything non in the
S
sleeping state. It's possible you've got zombie processes which can get counted as something running, despite not really doing anything.ps -eo pid,user,state,cmd | awk '$3 !~ /S/ {print $0}'
This will show any non-sleeping processes. (Running, waiting on IO, zombied, etc)It's worth noting that your load average isn't terribly alarming. Assuming you have more than two cores on the box, there's no doubt plenty of CPU power to go around. But obviously still worth looking into if you don't expect 1-2 processes running at any given time.
--Christopher Karel