I'm having a problem with a Linux system and I have found sysstat
and sar
to report huge peaks of disk I/O, average service time as well as average wait time.
How could I determine which process is causing these peaks the next time it happen?
Is it possible to do with sar
? Can I find this info from the already recorded sar
files?
Output of sar -d
, system stall happened around 12.58-13.01pm.
12:40:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:40:01 dev8-0 11.57 0.11 710.08 61.36 0.01 0.97 0.37 0.43
12:45:01 dev8-0 13.36 0.00 972.93 72.82 0.01 1.00 0.32 0.43
12:50:01 dev8-0 13.55 0.03 616.56 45.49 0.01 0.70 0.35 0.47
12:55:01 dev8-0 13.99 0.08 917.00 65.55 0.01 0.86 0.37 0.52
13:01:02 dev8-0 6.28 0.00 400.53 63.81 0.89 141.87 141.12 88.59
13:05:01 dev8-0 22.75 0.03 932.13 40.97 0.01 0.65 0.27 0.62
13:10:01 dev8-0 13.11 0.00 634.55 48.42 0.01 0.71 0.38 0.50
I also have this follow-up question to another thread I started yesterday:
If you are lucky enough to catch the next peak utilization period, you can study per-process I/O stats interactively, using iotop.
You can use pidstat to print cumulative io statistics per process every 20 seconds with this command:
Each row will have follwing columns:
Output looks like this:
Nothing beats ongoing monitoring, you simply cannot get time-sensitive data back after the event...
There are a couple of things you might be able to check to implicate or eliminate however —
/proc
is your friend.Fields 10, 11 are accumulated written sectors, and accumulated time (ms) writing. This will show your hot file-system partitions.
Those fields are PID, command and cumulative IO-wait ticks. This will show your hot processes, though only if they are still running. (You probably want to ignore your filesystem journalling threads.)
The usefulness of the above depends on uptime, the nature of your long running processes, and how your file systems are used.
Caveats: does not apply to pre-2.6 kernels, check your documentation if unsure.
(Now go and do your future-self a favour, install Munin/Nagios/Cacti/whatever ;-)
Use
atop
. (http://www.atoptool.nl/)Write the data to a compressed file that
atop
can read later in an interactive style. Take a reading (delta) every 10 seconds. do it 1080 times (3 hours; so if you forget about it the output file won't run you out of disk):After bad thing happens again:
(even if it is still running in the background, it just appends every 10 seconds)
Since you said IO, I would hit 3 keys: tdD
Use
btrace
. It's easy to use, for examplebtrace /dev/sda
. If the command is not available, it is probably available in package blktrace.EDIT: Since the debugfs is not enabled in the kernel, you might try
date >>/tmp/wtf && ps -eo "cmd,pid,min_flt,maj_flt" >>/tmp/wtf
or similar. Logging page faults is not of course at all the same than using btrace, but if you are lucky, it MAY give you some hint about the most disk hungry processes. I just tried that one on of my most I/O intensive servers and list included the processes I know are consuming lots of I/O.Disk utilization by each process:
$ glances
# (with htop the best tool to get idea what is going on. Hit right arrow keys for process sorting by disk utilization)$ sudo iotop -ao
# (-a accumulated; -o show only processes with activity)