On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.
The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).
I can clearly see that with the following script:
#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
ps -p $i | grep -v CMD
done
OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.
I have Debian 6.0 (Linux 2.6.32-3-amd64).
Does anyone know where to contorl the -17 oom_adj assignment behaviour?
Could launching sshd and Torque mom from /etc/rc.local
be causing the overprotective behaviour?
It gets inherited from the process that spawned it. If SSH is set to -17 then Bash will be. If you restart via Bash, you'll spawn it even further.
Editing the init script to change the value at the end of the startup process should fix this.
On our clusters we disable overcommit with sysctl:
You should fix the ratio depending on how much memory and swap you have.
Once overcommit is disabled the kernel just returns NULL to the process that is trying to allocate too much memory. It solved all our memory crashes on the cluster nodes.