I wrote a program (monitoring plugin) that reads the cpu usage numbers from /proc/stat
.
Unfortunately it seems the numbers do not match the manual page proc(5)
, especially I get different proportions for real machines and Xen paravirtualized machines:
The amount of idle time is different between these machines:
(An explanation on the "code blocks" following:
The first parts show the fields of the "cpu
" line (e.g.: "cpu#1
" is the first field of the CPU
-line), followed by a boot count ("epoch"), the UNIX time, and the actual value, each being separated with a colon (:
).
Next line starting with "stat: OK
" is the output of my monitoring plugin; here it outputs the differences for debugging purposes, but usually it would output difference rates.
It also adds user-readable labels to the numbers.
"time
" is the time difference since last call in seconds.
Finally I added the CPU-related lines from /proc/stat
(with some time elapsed since plugin output).)
First a two-CPU, 6 core, two threads each physical machine:
Idle time seems to be about 900 times the sum of the other CPU states, corresponding to 99.89% idle and 0.06% user CPU.
Also note that the relation of idle time to elapsed time is about 2398.5; when divided by USER_HZ
(100) you get the number of CPUs roughly.
It looks odd to me:
# physical two 6-core cpus, 2 threads each
cpu#1=0:1596547833:2667804
cpu#2=0:1596547833:90388
cpu#3=0:1596547833:1257514
cpu#4=0:1596547833:2735255340
cpu#5=0:1596547833:142707
cpu#6=0:1596547833:0
cpu#7=0:1596547833:107191
cpu#8=0:1596547833:0
cpu#9=0:1596547833:0
cpu#10=0:1596547833:0
stat OK: epoch=0, time=354, cpu.usr=581, cpu.ni=24, cpu.sys=288, cpu.idl=849070, cpu.iow=29, cpu.hirq=0, cpu.sirq=13, cpu.st=0, cpu.vgst=0, cpu.usr0=0
# cat /proc/stat
cpu 2668778 90430 1257998 2736664140 142741 0 107213 0 0 0
cpu0 116314 1436 53622 113861868 3864 0 81296 0 0 0
cpu1 142008 4782 32464 114026161 9767 0 4396 0 0 0
cpu2 167052 3058 63902 113932120 12818 0 1966 0 0 0
cpu3 120029 4260 28712 114058016 3337 0 1478 0 0 0
cpu4 145332 2972 61798 113983716 16115 0 1037 0 0 0
cpu5 114346 6809 27875 114060364 4110 0 1124 0 0 0
cpu6 126193 3720 54701 113999094 12348 0 968 0 0 0
cpu7 108188 4859 27436 114067537 6028 0 976 0 0 0
cpu8 121890 2820 51548 114020211 13474 0 940 0 0 0
cpu9 102942 4235 26150 114076765 3423 0 977 0 0 0
cpu10 125984 2724 48521 114014015 13950 0 845 0 0 0
cpu11 89154 4047 26674 114085160 7735 0 885 0 0 0
cpu12 116730 3894 397743 113663892 2352 0 884 0 0 0
cpu13 84306 4424 26164 114096015 2767 0 871 0 0 0
cpu14 127293 3539 44438 114033462 1294 0 922 0 0 0
cpu15 77740 3958 26201 114105245 358 0 854 0 0 0
cpu16 133217 3043 41476 114034324 737 0 958 0 0 0
cpu17 88893 4497 25736 114094645 662 0 838 0 0 0
cpu18 125887 2812 39150 114024555 1309 0 806 0 0 0
cpu19 65198 3560 25976 114092343 21838 0 802 0 0 0
cpu20 109361 3292 37270 114059144 1381 0 764 0 0 0
cpu21 71055 4094 26435 114111750 759 0 859 0 0 0
cpu22 118589 3643 37525 114052728 1567 0 883 0 0 0
cpu23 71069 3943 26468 114110998 737 0 875 0 0 0
Then a Xen paravirtualized machine with two virtual CPUs. The idle time is about 74 times the sum of the other CPU states, corresponding to 98.66% idle and 8% user cpu. Again if you take the proportion of idle time to elapsed time, you get 197.4, roughly corresponding to 2 CPUs. Here's one problem: User CPU and Idle exceed 100%.
## virtual 2 cpus (Xen PV)
cpu#1=0:1596547988:1162034
cpu#2=0:1596547988:227660
cpu#3=0:1596547988:3036855
cpu#4=0:1596547988:701649884
cpu#5=0:1596547988:1037577
cpu#6=0:1596547988:0
cpu#7=0:1596547988:31478
cpu#8=0:1596547988:355862
cpu#9=0:1596547988:0
cpu#10=0:1596547988:0
stat OK: epoch=0, time=36, cpu.usr=16, cpu.ni=7, cpu.sys=28, cpu.idl=7108, cpu.iow=4, cpu.hirq=0, cpu.sirq=0, cpu.st=5, cpu.vgst=0, cpu.usr0=0
> cat /proc/stat
cpu 1162136 227690 3037149 701727879 1037664 0 31481 355901 0 0
cpu0 531438 112727 1469157 350497090 791387 0 31011 192100 0 0
cpu1 630698 114962 1567991 351230788 246276 0 470 163801 0 0
I know that these numbers in /proc/stat
are USER_HZ, but that shouldn't matter as a common factor, right?
I feel the idle proportion does not match the rest of the CPU states (too high?), but I fail to recognize what's wrong.
(I also realized that for multiple cores you can never read read those numbers from /proc/stat
consistently, but the differences would be small enough to ignore)