Is there a reasonable/safe level of CPU load in the context of high performance computing?
I understand the meaning of load average for servers in general, but do not know what to expect for servers built and used for high performance computing.
Is the usual convention of load <= # of cores
applicable in this environment?
I'm curious given my system-specific details, where typically load >> # of cores
for each node:
- 24 physical cores, hyperthreading for 48 virtual cores (relatively new hardware)
- load averages: typically 100-300
The nodes have high uptime with usually high CPU usage/load. There are very few hardware failures, especially for CPUs, but I don't know what to expect over the lifetime of the nodes given high load.
Example top
output:
top - 14:12:53 up 4 days, 5:45, 1 user, load average: 313.33, 418.36, 522.87
Tasks: 501 total, 5 running, 496 sleeping, 0 stopped, 0 zombie
%Cpu(s): 33.5 us, 50.9 sy, 0.0 ni, 15.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19650371+total, 46456320 free, 43582952 used, 10646443+buff/cache
KiB Swap: 13421772+total, 78065520 free, 56152200 used. 15164291+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
85642 user 20 0 36.5g 7.6g 245376 S 1566 4.0 1063:21 python
97440 user 20 0 33.1g 5.3g 47460 S 1105 2.8 512:10.86 python
97297 user 20 0 31.0g 4.0g 69828 S 986.4 2.1 430:16.32 python
181854 user 20 0 19.3g 5.0g 19944 R 100.0 2.7 2823:09 python
...
Output of iostat -x 5 3
on the same server:
avg-cpu: %user %nice %system %iowait %steal %idle
50.48 0.00 12.06 0.38 0.00 37.08
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 350.41 705.68 58.12 22.24 2126.25 3393.61 137.36 6.02 74.93 9.10 246.94 1.19 9.56
dm-0 0.00 0.00 4.87 8.70 511.41 516.65 151.59 0.31 22.55 28.40 19.28 2.62 3.56
dm-1 0.00 0.00 403.67 719.23 1614.71 2876.92 8.00 8.83 7.10 7.38 6.95 0.08 9.05
dm-2 0.00 0.00 0.00 0.00 0.02 0.01 65.03 0.00 3.74 3.82 1.00 2.12 0.00
Load average shows the queue of threads ready to run. In Linux, this includes also threads waiting for disk. It could happen that a broken NFS server could increase the load average to insane numbers. This doesn't mean that the CPUs are hogged.
So the load average is showing just one side of the story and can not be taken alone, that's why I was asking for
top
output.Some workloads are note parallelizable. This means that all steps will run on the same core one after another. Real problems are usually partially parallelizable.
In performance you have some goals and limitations. Like low latency, throughput, cost (initial cost and operational ones)...
If you are interested in throughput and low cost, having a high queue, could be normal. All your CPU cores will be at 100% usage all the time.
Load average is merely a symptom, a useful metric easy for the operating system to report. A doctor can't diagnose what's wrong a human patient with just a fever symptom, they have a lot more questions about what's going on. Similarly with computer patients, need a lot more context as to how they are performing.
Load average can vary quite a bit between systems. Some platforms do not report tasks likely doing I/O in load average, which is different than how Linux does it. Some hosts can have load average per core be in the dozens and not fall over. Some applications are very latency sensitive, and load per core greater than one is visible in poor user response time.
In addition to OS level metrics, collect application specific performance benchmarks, and trend them over time. Generic examples:
Getting some measure of the useful work a system does is necessary to put the OS metrics into context. Your system appears to be doing useful work even at a relatively high load average. Contrast to a fork bomb, which drives load to unusable high levels, but as a denial of service attack does nothing useful.