I have a pipeline that runs some computationally intensive tasks on a Linux machine. The script that launches these checks the current load average and—if it is above a certain threshold— waits until the load falls below it. This is on an Ubuntu virtual machine (running on an Ubuntu host, if that's relevant) which can have a variable number of cores assigned to it. Both our development and production machines are VMs running on the same physical server and we manually allocate cores to each as needed.
I have noticed that even when the VM has as few as 20 cores, a load of ~60 isn't bringing the machine to its knees. My understanding of how Linux load average works was that anything above the number of CPUs is indicative of a problem, but apparently things are not quite as clearcut as all that.
I'm thinking of setting the threshold at something like $(grep -c processor /proc/cpuinfo) x N
where N>=1
. Is there any clever way of determining the value N
should take so as to both maximise performance and minimize lag?
In other words, how can I know what maximum load average a machine can support before performance starts to degrade? I had naively expected that to be the number of CPUs (so, N=1
) but that doesn't seem to hold up. Since the number of cores can vary, testing possible combinations is both complicated and time consuming and, since this is a machine used by various people, impractical.
So, how can I decide an acceptable max average load threshold as a function of the number of available cores?