Ping a Specific Port

Question

mhost

Asked: 2011-09-14 11:06:34 +0800 CST2011-09-14 11:06:34 +0800 CST 2011-09-14 11:06:34 +0800 CST

Linux server ran out of memory stopped everything except ssh and swap was unused

772

A few days ago my linux apache server ran out of memory. The server is a xen guest. The server killed all my processes except ssh (even cron, monit, syslog).

A minute before it ran out of memory, a script of mine saved the output of various commands. Here was the output:

# free -m
             total       used       free     shared    buffers     cached
Mem:          2003       1866        137          0          3       1159
-/+ buffers/cache:        703       1300
Swap:          511          0        511

# vmstat 1 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1    144 140944   4192 1186964    0    0     0     0    1    0  1  0 98  1
 3  1    144 142816   3704 1186896    0    0  4188 10132 3242 3242  0 19 59 22

# pidstat -dru 1 3
Linux 2.6.39-linode33 (web1)    09/10/11        _i686_

01:07:15          PID   %user %system    %CPU   CPU  Command
01:07:16          271    0.00   10.78   10.78     0  kswapd0
01:07:16         9871    0.00  100.00  100.00     1  flush-0:17
01:07:16        13082    0.00    1.96    1.96     0  apache2
01:07:16        15100    0.98    5.88    6.86     0  rm
01:07:16        15250    0.00    0.98    0.98     0  pidstat

01:07:15          PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
01:07:16        13082      1.96      0.98   62872  29720   1.45  apache2
01:07:16        15250    224.51      0.00    1704    636   0.03  pidstat

01:07:15          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
01:07:16        13082     78.43      7.84      0.00  apache2
01:07:16        15100   1317.65      0.00      0.00  rm
01:07:16        15250      3.92      0.00      0.00  pidstat

01:07:16          PID   %user %system    %CPU   CPU  Command
01:07:17            3    0.00    0.99    0.99     0  ksoftirqd/0
01:07:17           12    0.00    0.99    0.99     2  ksoftirqd/2
01:07:17          271    0.00    1.98    1.98     2  kswapd0
01:07:17         9871    0.00  100.00  100.00     1  flush-0:17
01:07:17        15100    0.00   17.82   17.82     2  rm
01:07:17        15232   17.82    0.99   18.81     0  apache2
01:07:17        15250    0.99    1.98    2.97     0  pidstat
01:07:17        19944    0.00    0.99    0.99     0  kworker/0:2

01:07:16          PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
01:07:17        15232    298.02      0.00   63096  12820   0.62  apache2
01:07:17        15250    231.68      0.00    1704    664   0.03  pidstat

01:07:16          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
01:07:17        15100   4261.39      0.00      0.00  rm
01:07:17        15232      3.96      7.92      0.00  apache2
01:07:17        15250     11.88      0.00      0.00  pidstat

01:07:17          PID   %user %system    %CPU   CPU  Command
01:07:18          271    0.00   11.11   11.11     2  kswapd0
01:07:18         9871    0.00  100.00  100.00     1  flush-0:17
01:07:18        15100    1.01   27.27   28.28     2  rm
01:07:18        15250    0.00    1.01    1.01     0  pidstat
01:07:18        19944    0.00    1.01    1.01     0  kworker/0:2

01:07:17          PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
01:07:18        15250    229.29      0.00    1704    664   0.03  pidstat

01:07:17          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
01:07:18        15100   4230.30      0.00      0.00  rm
01:07:18        15250     16.16      4.04      0.00  pidstat

Average:          PID   %user %system    %CPU   CPU  Command
Average:            3    0.00    0.33    0.33     -  ksoftirqd/0
Average:           12    0.00    0.33    0.33     -  ksoftirqd/2
Average:          271    0.00    7.95    7.95     -  kswapd0
Average:         9871    0.00  100.00  100.00     -  flush-0:17
Average:        13082    0.00    0.66    0.66     -  apache2
Average:        15100    0.66   16.89   17.55     -  rm
Average:        15232    5.96    0.33    6.29     -  apache2
Average:        15250    0.33    1.32    1.66     -  pidstat
Average:        19944    0.00    0.66    0.66     -  kworker/0:2

Average:          PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
Average:        13082      0.66      0.33   62872  29720   1.45  apache2
Average:        15232     99.67      0.00   62927  12456   0.61  apache2
Average:        15250    228.48      0.00    1704    655   0.03  pidstat

Average:          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
Average:        13082     26.49      2.65      0.00  apache2
Average:        15100   3256.95      0.00      0.00  rm
Average:        15232      1.32      2.65      0.00  apache2
Average:        15250     10.60      1.32      0.00  pidstat

Monit reported these stats (also just one minute before it ran out of memory):

System 'local'
  status                            Resource limit matched
  monitoring status                 monitored
  load average                      [3.57] [2.63] [1.30]
  cpu                               0.5%us 18.5%sy 37.1%wa
  memory usage                      726464 kB [35.4%]
  swap usage                        148 kB [0.0%]
  data collected                    Sat Sep 10 01:06:06 2011

Here is an excerpt of what was logged in kern.log:

Sep 10 01:07:31 web1 kernel: Out of memory: Kill process 2190 (portmap) score 1 or sacrifice child
Sep 10 01:07:31 web1 kernel: Killed process 2190 (portmap) total-vm:1820kB, anon-rss:100kB, file-rss:360kB
Sep 10 01:07:31 web1 kernel: apache2 invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
Sep 10 01:07:31 web1 kernel: Pid: 31143, comm: apache2 Tainted: G        W   2.6.39-linode33 #5
Sep 10 01:07:31 web1 kernel: Call Trace:
Sep 10 01:07:31 web1 kernel: [<c0185c10>] ? T.663+0x80/0xd0
Sep 10 01:07:31 web1 kernel: [<c0185cbe>] ? T.661+0x5e/0x160
Sep 10 01:07:31 web1 kernel: [<c013d8ac>] ? has_capability_noaudit+0xc/0x20
Sep 10 01:07:31 web1 kernel: [<c0185a1c>] ? oom_badness+0xdc/0x130
Sep 10 01:07:31 web1 kernel: [<c0185f76>] ? out_of_memory+0x1b6/0x2f0
Sep 10 01:07:31 web1 kernel: [<c0189715>] ? __alloc_pages_nodemask+0x6b5/0x6d0
Sep 10 01:07:31 web1 kernel: [<c0131c35>] ? dup_task_struct+0x55/0x140
Sep 10 01:07:31 web1 kernel: [<c0132728>] ? copy_process+0x88/0xb80
Sep 10 01:07:31 web1 kernel: [<c013326f>] ? do_fork+0x4f/0x290
Sep 10 01:07:31 web1 kernel: [<c0156084>] ? getnstimeofday+0x44/0x110
Sep 10 01:07:31 web1 kernel: [<c010f510>] ? sys_clone+0x30/0x40
Sep 10 01:07:31 web1 kernel: [<c06900cd>] ? ptregs_clone+0x15/0x48
Sep 10 01:07:31 web1 kernel: [<c068f7b1>] ? syscall_call+0x7/0xb
Sep 10 01:07:31 web1 kernel: [<c0680000>] ? sctp_icmp_proto_unreachable+0x20/0xc0
Sep 10 01:07:31 web1 kernel: Mem-Info:
Sep 10 01:07:31 web1 kernel: DMA per-cpu:
Sep 10 01:07:31 web1 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    2: hi:    0, btch:   1 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    3: hi:    0, btch:   1 usd:   0
Sep 10 01:07:31 web1 kernel: Normal per-cpu:
Sep 10 01:07:31 web1 kernel: CPU    0: hi:  186, btch:  31 usd:  36
Sep 10 01:07:31 web1 kernel: CPU    1: hi:  186, btch:  31 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    2: hi:  186, btch:  31 usd:  18
Sep 10 01:07:31 web1 kernel: CPU    3: hi:  186, btch:  31 usd:   0
Sep 10 01:07:31 web1 kernel: HighMem per-cpu:
Sep 10 01:07:31 web1 kernel: CPU    0: hi:  186, btch:  31 usd:  54
Sep 10 01:07:31 web1 kernel: CPU    1: hi:  186, btch:  31 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    2: hi:  186, btch:  31 usd:   0
Sep 10 01:07:31 web1 kernel: CPU    3: hi:  186, btch:  31 usd:  22
Sep 10 01:07:31 web1 kernel: active_anon:7174 inactive_anon:3515 isolated_anon:0
Sep 10 01:07:31 web1 kernel: active_file:156159 inactive_file:133601 isolated_file:0
Sep 10 01:07:31 web1 kernel: unevictable:0 dirty:12 writeback:0 unstable:0
Sep 10 01:07:31 web1 kernel: free:39375 slab_reclaimable:166268 slab_unreclaimable:3027
Sep 10 01:07:31 web1 kernel: mapped:1913 shmem:8277 pagetables:95 bounce:0
Sep 10 01:07:31 web1 kernel: DMA free:2908kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:292kB inactive_file:296kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:988kB slab_unreclaimable:132kB kernel_stack:72kB pagetables:20kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:11075 all_unreclaimable? yes
Sep 10 01:07:31 web1 kernel: lowmem_reserve[]: 0 702 2008 2008
Sep 10 01:07:31 web1 kernel: Normal free:5376kB min:4004kB low:5004kB high:6004kB active_anon:0kB inactive_anon:0kB active_file:2316kB inactive_file:2264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:719320kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:664084kB slab_unreclaimable:11976kB kernel_stack:512kB pagetables:360kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:84892 all_unreclaimable? yes
Sep 10 01:07:31 web1 kernel: lowmem_reserve[]: 0 0 10444 10444
Sep 10 01:07:31 web1 kernel: HighMem free:149216kB min:512kB low:2372kB high:4236kB active_anon:28696kB inactive_anon:14060kB active_file:622028kB inactive_file:531844kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1336932kB mlocked:0kB dirty:48kB writeback:0kB mapped:7652kB shmem:33108kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 10 01:07:31 web1 kernel: lowmem_reserve[]: 0 0 0 0
Sep 10 01:07:31 web1 kernel: DMA: 231*4kB 126*8kB 21*16kB 4*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2908kB
Sep 10 01:07:31 web1 kernel: Normal: 1310*4kB 18*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5384kB
Sep 10 01:07:31 web1 kernel: HighMem: 18118*4kB 7805*8kB 836*16kB 27*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 149216kB
Sep 10 01:07:31 web1 kernel: 298087 total pagecache pages
Sep 10 01:07:31 web1 kernel: 35 pages in swap cache
Sep 10 01:07:31 web1 kernel: Swap cache stats: add 2601, delete 2566, find 2926/3087
Sep 10 01:07:31 web1 kernel: Free swap  = 524144kB
Sep 10 01:07:31 web1 kernel: Total swap = 524284kB
Sep 10 01:07:31 web1 kernel: 1050608 pages RAM
Sep 10 01:07:31 web1 kernel: 865282 pages HighMem
Sep 10 01:07:31 web1 kernel: 537706 pages reserved
Sep 10 01:07:31 web1 kernel: 7195 pages shared
Sep 10 01:07:31 web1 kernel: 468537 pages non-shared
Sep 10 01:07:31 web1 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Sep 10 01:07:31 web1 kernel: [ 1059]     0  1059      552      184   3     -17         -1000 udevd
Sep 10 01:07:31 web1 kernel: [ 2196]   104  2196      470      175   0       0             0 rpc.statd
Sep 10 01:07:31 web1 kernel: [ 2323]     0  2323     1044      575   2       0             0 syslog-ng
Sep 10 01:07:31 web1 kernel: [ 2334]     0  2334     1332      252   3     -17         -1000 sshd
Sep 10 01:07:31 web1 kernel: [ 2360]   101  2360     1047      303   3       0             0 ntpd
Sep 10 01:07:31 web1 kernel: [ 2426]     0  2426      425      125   0       0             0 getty
Sep 10 01:07:31 web1 kernel: [32223]     0 32223      523      220   0       0             0 cron
Sep 10 01:07:31 web1 kernel: [ 8378]     0  8378     3478      566   0       0             0 monit
Sep 10 01:07:31 web1 kernel: [31143]     0 31143    15446     2689   0       0             0 apache2
Sep 10 01:07:31 web1 kernel: [15030]     0 15030      606      228   2       0             0 cron
Sep 10 01:07:31 web1 kernel: [15034]     0 15034      622      248   3       0             0 sh
Sep 10 01:07:31 web1 kernel: [15100]     0 15100      429       58   2       0             0 rm
Sep 10 01:07:31 web1 kernel: Out of memory: Kill process 2196 (rpc.statd) score 1 or sacrifice child
Sep 10 01:07:31 web1 kernel: Killed process 2196 (rpc.statd) total-vm:1880kB, anon-rss:124kB, file-rss:576kB

Also, since it was 1 AM, there were almost no apache connections at the time.

As you can see, swap wasn't being used at all. There was still quite a bit of free ram (according to monit).

Any ideas how I can figure out the cause?

Also, what is the "flush-0:17" that was using so much CPU according to pidstat

4 Answers

Voted

polynomial · Answer 1 · 2011-10-01T20:39:05+08:00

polynomial

2011-10-01T20:39:05+08:002011-10-01T20:39:05+08:00

From the log output apache is trying to malloc a very large allocation. To figure out why you'll probably need to look at your apache setup and see what can do that (are you using mod_perl, mod_python, etc). If you can't find it that way you can put a proxy like nginx in front of apache and then nginx will log which request failed. If you want you can run nginx on the same host and just use the limits command to restrict apache so it gets killed before the oom killer is invoked (letting nginx continue, or urlsnarf, etc).

The flush PID is the kernel thread that handles syncing VFS cache buffers. Most likely related to the rm. You can verify by looking up the flush thread and which device its working for and verifying thats the device rm was working on:

# ps axwwu | grep flush
root     21658  0.0  0.0      0     0 ?        S    21:33   0:00 [flush-8:16]
# ls -l /dev/ | grep 8 | grep 16
brw-rw----  1 root disk      8,  16 2011-09-13 13:53 sdb

1

Sergey Vlasov · Answer 2 · 2011-10-24T04:57:31+08:00

Your system has run out of low memory (RAM which is directly mapped into the kernel virtual memory space). In a 32-bit Linux system the kernel can use at most about 700-800 MB of physical RAM for its internal data (and adding more RAM actually makes the situation worse, because more low memory is needed for the data structures used by the memory manager). The rest of RAM goes into the “high memory” zone, which can be used by userspace processes and the page cache, but is not usable for the kernel itself.

Excerpt from your logs:

DMA free:2908kB ... present:15808kB ...
Normal free:5376kB ... present:719320kB ... slab_reclaimable:664084kB slab_unreclaimable:11976kB ...
HighMem free:149216kB ... present:1336932kB ...

The DMA zone is the low 16 MB needed to work with obsolete ISA devices; these days it can mostly be ignored. Low memory is the DMA and Normal zones together; only this memory can be used to hold kernel data. The rest of memory goes to the HighMem zone.

Note that there was plenty of free memory in the HighMem zone, but the system still ran out of memory, because the memory was needed for some structures used by the kernel, and HighMem is not suitable for this. And most of the Normal zone was occupied by the slab cache (slab_reclaimable and slab_unreclaimable) — this is another kernel memory allocator, used to allocate blocks smaller than a page. Unfortunately, it is impossible to find out from your logs which slab allocations took so much memory (current slab cache usage statistics can be read from /proc/slabinfo; maybe you will want to set up some monitoring for this data).

The real problem might be some memory leak bug in the kernel; monitoring /proc/slabinfo should help to find out the reason if the problem happens again.

Arek B. · Answer 3 · 2011-10-24T02:08:47+08:00

Arek B.

2011-10-24T02:08:47+08:002011-10-24T02:08:47+08:00

From my point of view OOM Killer is not always killing the process which is guilty. However if you have apache (prefork?) it's most common that he is causing memory usage.

Also, since it was 1 AM, there were almost no apache connections at the time.

I would be carefull here. You cannot tell how many connections did you have (or maybe you can? - sorry). Invoke tool like Collectd which could very densely plot the graphs of in this case very useful: memory, number of tcp connections (on ports used by apache of course) and also http errors (especialy server errors 50x).

I have been shocked sometimes how many things can change in 20 seconds on server.

What values do you have of: vm.overcommit_memory vm.overcommit_ratio (this may be too low causing killing processes way too early) What is your average apache process memory usage?

0

korkman · Answer 4 · 2011-10-24T04:00:56+08:00

korkman

2011-10-24T04:00:56+08:002011-10-24T04:00:56+08:00

What was the 'rm' doing, anyways? Possibly cleared some cache which scripts missed and thus started rebuilding? Was some search engine unfortunately spidering at that same time? Maybe a cache-miss storm?

0

Linux server ran out of memory stopped everything except ssh and swap was unused

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?