Ping a Specific Port

Question

Aleksandr Levchuk

Asked: 2011-05-21 16:53:21 +0800 CST2011-05-21 16:53:21 +0800 CST 2011-05-21 16:53:21 +0800 CST

OOM killer goes insane

772

On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.

The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).

I can clearly see that with the following script:

#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
  ps -p $i | grep -v CMD
done

OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.

I have Debian 6.0 (Linux 2.6.32-3-amd64).

Does anyone know where to contorl the -17 oom_adj assignment behaviour?

Could launching sshd and Torque mom from /etc/rc.local be causing the overprotective behaviour?

2 Answers

Voted

Jim · Answer 1 · 2011-05-21T17:16:40+08:00

Jim

2011-05-21T17:16:40+08:002011-05-21T17:16:40+08:00

It gets inherited from the process that spawned it. If SSH is set to -17 then Bash will be. If you restart via Bash, you'll spawn it even further.

[i-180ae177] root@migrantgeek ~ # pgrep mysqld_safe
11395
[i-180ae177] root@migrantgeek ~ # cat /proc/11395/oom_adj 
0
[i-180ae177] root@migrantgeek ~ # for pid in `pgrep bash`; do echo -17 >  /proc/$pid/oom_adj; done
[i-180ae177] root@migrantgeek ~ # /etc/init.d/mysqld  restart
Stopping MySQL:                                            [  OK  ]
Starting MySQL:                                            [  OK  ]
[i-180ae177] root@migrantgeek ~ # pgrep mysqld_safe
11523
[i-180ae177] root@migrantgeek ~ # cat /proc/11523/oom_adj 
-17

Editing the init script to change the value at the end of the startup process should fix this.

2

Daniel · Answer 2 · 2011-05-21T17:34:50+08:00

Daniel

2011-05-21T17:34:50+08:002011-05-21T17:34:50+08:00

On our clusters we disable overcommit with sysctl:

vm.overcommit_ratio=60
vm.overcommit_memory=2

You should fix the ratio depending on how much memory and swap you have.

Once overcommit is disabled the kernel just returns NULL to the process that is trying to allocate too much memory. It solved all our memory crashes on the cluster nodes.

2

OOM killer goes insane

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?