I'm diagnosing a high CPU usage event, and I found a weird difference between numbers from ps/vmstat
, which show almost 0%, and sar/top
, which show almost 100% (user + system):
sar 1 5
Linux 2.6.9-67.ELsmp (uxdfl712) 07/25/2020
01:48:31 PM CPU %user %nice %system %iowait %idle
01:48:32 PM all 43.83 0.00 56.17 0.00 0.00
01:48:33 PM all 42.68 0.00 57.32 0.00 0.00
01:48:34 PM all 42.57 0.00 57.43 0.00 0.00
01:48:35 PM all 43.18 0.00 56.82 0.00 0.00
Average: all 43.14 0.00 56.86 0.00 0.00
vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
32 0 0 10493612 233320 4485160 0 0 0 14 0 1 0 0 100 0
ps -e hao %cpu | awk '{ sum += $1 } END { print sum }'
0.2
top -bn 1 |
sed '1,/PID USER PR NI %CPU/d' |
awk '{ sum += $5 } END { print sum }'
398
I searched a lot in StackExchange and elsewhere, but all I could find were references about virtualization stuff (this is a physical machine) and CPU load, which is not my issue. I also checked out /proc/<PID>/stat
, but found no hint on this.
Why do these commands show different numbers? Are they actually querying different things? Or may the executables just be too old and buggy (pls see server data below - I'm indeed in horror on how outdated this is).
Thanks!
uname -r
2.6.9-67.ELsmp
cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 6)
yum provides `which sar` | grep installed
sysstat.i386 5.0.5-16.rhel4 installed
yum provides `which vmstat` | grep installed
procps.i386 3.2.3-8.9 installed
yum provides `which ps`
<Too many providers>
ps -V
procps version 3.2.3
yum provides `which top` | grep installed
procps.i386 3.2.3-8.9 installed
grep -c processor /proc/cpuinfo
4
This is an intermittent, occasional load. The first line of vmstat
gives averages since the last reboot
, which apparently on this host is mostly idle. Subsequent lines show data for the sampling period, which will be closer to what sar is reporting.0% idle for an extended period of time is generally not good. But how bad running out of CPU is really depends on the system and applications.
Evaluate how the applications are performing on this box. How is response time to user requests? Is it doing batch processing in time? If your performance expectations are not met, that is a reason to improve things.
In addition to hardware age, this is older software; RHEL 4 entered extended support 8 years ago. On a modern Linux, finding exactly what's on CPU is easy. Install debug symbols, and run
perf top
. And anything can be instrumented in detail. However, I don't remember how good the performance tools on RHEL 4 were.Really, if this host is to continue to provide value, it should be upgraded. To get security updates again, if nothing else.