I have a web site with users lighttpd and CGI scripts.
After upgrading to Fedora 11 (ext4) the disc access became erratic.
The timing of python -c 'import cgi'
varies between 0.1 to almost 10 seconds:
How can I diagnose the problem? (Tools, methods, best practices ...)
Update Jul 30, 2009:
Found out that several CGI process were hogging the drive. After killing them the graph is stable between 0.02 and 0.03. Still didn't get an answer on how to diagnose such problems.
If it is fresh install then tools like makewhatis which are used by apropos, whatis might cause disk to be heavily used. Wait for few days for things to get stabilized (updatedb, prelink, makewhatis, etc.) then may be timings will be consistent.
It would also depend on something else you are doing on server and what the cgi script is actually doing, where it is taking input from, size of input, etc.
Also if disk is very old, use diagnostic tools (like seagate seatools) to look for controller / bad sector problems. The tools will also allow you to optionally repair the sector if drive is actually from seagate.
Do you really need/want ext4 on a production server? It's a still a mighty bit to green for my taste for a server.
The only way to diagnose a problem like this is with lots and lots of data. Familiarize yourself with
vmstat
andiostat
. A tool I recently learned about in this thread isdstat
which effectively combines the two.For problems like the one you're describing, this command would likely be useful:
It will report on CPU, IO (disk and net), interrupts, swap, and load average. As a nice little bonus, it will include the name of whatever process was "most expensive" a the time the snapshot was taken. Unfortunately that particular command produces output too wide to paste here, so here's a bit more conservative version: