we have a server (ESXi virtual machine) at work that time to time gets frozen because of "kernel panic: Out of memory and no killable processes..."
Host's memory is 12GB.
Configuration of virtual machine
- VMware ESXi
- VM version 7
- 2 CPU
- Memory 8192
- memory reservation 0, memory limit setting = unlimited
SuSe 11.3 (64 bit) + kernel 2.6.34-12
firebird, postresql, db2
- php5.3, PHP-FPM, LIGHTTPD, MEMCACHED, OOo
the comp is NOT heavily used, it crashes once a day, once in two days. Sometimes it happends over weekned.
How can I find out what is causing the server to crash?
extract from vmware.log file
Apr 03 07:21:22.266: vcpu-0| Vix: [17514025 vmxCommands.c:7612]: VMAutomation_HandleCLIHLTEvent. Do nothing.
Apr 03 07:21:22.266: vcpu-0| Msg_Hint: msg.monitorevent.halt (sent)
Apr 03 07:21:22.266: vcpu-0| The CPU has been disabled by the guest operating system. You will need to power off or reset the virtual machine at this point.
Apr 03 07:21:22.266: vcpu-0| ---------------------------------------
Apr 03 07:21:37.167: vmx| GuestRpcSendTimedOut: message to toolbox timed out.
Apr 03 07:21:37.167: vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
Apr 03 22:30:06.017: mks| MKS: Base polling period is 10000us
UPDATE I (bit of /var/log/messages)
extract from /var/log/messages where it all (probably) starts. I am going to remove /opt/eduserver/bin/php
from cron and we will see if the crash is going to happen again.
Apr 9 22:15:02 testing /usr/sbin/cron[4312]: (root) CMD (/opt/eduserver/bin/php /srv/www/htdocs/imacs/radek/trunk/lib/views/edu_scheduler/controllers/action_scheduler.php >/var/lib/edumate/imacs/radek/trunk/scheduler )
Apr 9 22:15:20 testing kernel: [115148.493482] oom_kill_process: 3 callbacks suppressed
Apr 9 22:15:20 testing kernel: [115148.493485] php invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:15:20 testing kernel: [115148.493488] Pid: 4317, comm: php Not tainted 2.6.34-12-desktop #1
Apr 9 22:15:20 testing kernel: [115148.493490] Call Trace:
Apr 9 22:15:20 testing kernel: [115148.493511] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:15:20 testing kernel: [115148.493516] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:15:20 testing kernel: [115148.493522] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:15:20 testing kernel: [115148.493525] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:15:20 testing kernel: [115148.493529] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:15:20 testing kernel: [115148.493533] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:15:20 testing kernel: [115148.493536] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:15:20 testing kernel: [115148.493541] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:15:20 testing kernel: [115148.493545] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:15:20 testing kernel: [115148.493548] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:15:20 testing kernel: [115148.493553] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:15:20 testing kernel: [115148.493557] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:15:20 testing kernel: [115148.493561] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:15:20 testing kernel: [115148.493565] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:15:20 testing kernel: [115148.493587] [<00007f52b7d4cce5>] 0x7f52b7d4cce5
Apr 9 22:15:20 testing kernel: [115148.493588] Mem-Info:
Apr 9 22:15:20 testing kernel: [115148.493590] Node 0 DMA per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493592] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493593] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493595] Node 0 DMA32 per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493597] CPU 0: hi: 186, btch: 31 usd: 155
Apr 9 22:15:20 testing kernel: [115148.493598] CPU 1: hi: 186, btch: 31 usd: 161
Apr 9 22:15:20 testing kernel: [115148.493600] Node 0 Normal per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493601] CPU 0: hi: 186, btch: 31 usd: 173
Apr 9 22:15:20 testing kernel: [115148.493603] CPU 1: hi: 186, btch: 31 usd: 57
Apr 9 22:15:20 testing kernel: [115148.493607] active_anon:1465647 inactive_anon:288016 isolated_anon:0
Apr 9 22:15:20 testing kernel: [115148.493607] active_file:129 inactive_file:784 isolated_file:0
Apr 9 22:15:20 testing kernel: [115148.493608] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:15:20 testing kernel: [115148.493609] free:11853 slab_reclaimable:4721 slab_unreclaimable:64985
Apr 9 22:15:20 testing kernel: [115148.493609] mapped:14998 shmem:15500 pagetables:161144 bounce:0
Apr 9 22:15:20 testing kernel: [115148.493611] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493618] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:15:20 testing kernel: [115148.493621] Node 0 DMA32 free:24432kB min:4272kB low:5340kB high:6408kB active_anon:2097640kB inactive_anon:524448kB active_file:52kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:448kB shmem:360kB slab_reclaimable:1988kB slab_unreclaimable:97472kB kernel_stack:17712kB pagetables:239608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:144 all_unreclaimable? no
Apr 9 22:15:20 testing kernel: [115148.493629] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:15:20 testing kernel: [115148.493631] Node 0 Normal free:7168kB min:7192kB low:8988kB high:10788kB active_anon:3764948kB inactive_anon:627616kB active_file:464kB inactive_file:3072kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59544kB shmem:61640kB slab_reclaimable:16896kB slab_unreclaimable:162468kB kernel_stack:28984kB pagetables:404968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1440 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493639] lowmem_reserve[]: 0 0 0 0
Apr 9 22:15:20 testing kernel: [115148.493641] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:15:20 testing kernel: [115148.493648] Node 0 DMA32: 272*4kB 140*8kB 31*16kB 127*32kB 84*64kB 42*128kB 11*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 24432kB
Apr 9 22:15:20 testing kernel: [115148.493655] Node 0 Normal: 840*4kB 26*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7168kB
Apr 9 22:15:20 testing kernel: [115148.493662] 19767 total pagecache pages
Apr 9 22:15:20 testing kernel: [115148.493663] 3345 pages in swap cache
Apr 9 22:15:20 testing kernel: [115148.493664] Swap cache stats: add 531666, delete 528321, find 103411/104065
Apr 9 22:15:20 testing kernel: [115148.493666] Free swap = 0kB
Apr 9 22:15:20 testing kernel: [115148.493667] Total swap = 2103292kB
Apr 9 22:15:20 testing kernel: [115148.514162] 2097136 pages RAM
Apr 9 22:15:20 testing kernel: [115148.514164] 48271 pages reserved
Apr 9 22:15:20 testing kernel: [115148.514165] 106772 pages shared
Apr 9 22:15:20 testing kernel: [115148.514166] 2006923 pages non-shared
Apr 9 22:15:20 testing kernel: [115148.514169] Out of memory: kill process 3016 (cron) score 308233 or a child
Apr 9 22:15:20 testing kernel: [115148.514171] Killed process 15546 (cron) vsz:50064kB, anon-rss:272kB, file-rss:32kB
Apr 9 22:16:01 testing /usr/sbin/cron[4347]: (root) CMD (/usr/bin/ruby /root/radek/scripts/freemem.rb)
Apr 9 22:17:07 testing kernel: [115255.428734] vmtoolsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:07 testing kernel: [115255.428738] Pid: 2772, comm: vmtoolsd Not tainted 2.6.34-12-desktop #1
Apr 9 22:17:08 testing kernel: [115255.428740] Call Trace:
Apr 9 22:17:08 testing kernel: [115255.428751] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:17:08 testing kernel: [115255.428756] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:17:08 testing kernel: [115255.428762] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:17:08 testing kernel: [115255.428765] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:17:08 testing kernel: [115255.428769] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:17:08 testing kernel: [115255.428773] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:17:08 testing kernel: [115255.428777] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:17:08 testing kernel: [115255.428781] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:17:08 testing kernel: [115255.428785] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:17:08 testing kernel: [115255.428788] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:17:08 testing kernel: [115255.428793] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:17:08 testing kernel: [115255.428802] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:17:08 testing kernel: [115255.428824] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:17:08 testing kernel: [115255.428828] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:17:08 testing kernel: [115255.428841] [<00007f09951973c0>] 0x7f09951973c0
Apr 9 22:17:08 testing kernel: [115255.428842] Mem-Info:
Apr 9 22:17:08 testing kernel: [115255.428844] Node 0 DMA per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428846] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428847] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428848] Node 0 DMA32 per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428850] CPU 0: hi: 186, btch: 31 usd: 44
Apr 9 22:17:08 testing kernel: [115255.428852] CPU 1: hi: 186, btch: 31 usd: 174
Apr 9 22:17:08 testing kernel: [115255.428853] Node 0 Normal per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428855] CPU 0: hi: 186, btch: 31 usd: 146
Apr 9 22:17:08 testing kernel: [115255.428856] CPU 1: hi: 186, btch: 31 usd: 171
Apr 9 22:17:08 testing kernel: [115255.428860] active_anon:1464570 inactive_anon:287629 isolated_anon:0
Apr 9 22:17:08 testing kernel: [115255.428861] active_file:66 inactive_file:2047 isolated_file:64
Apr 9 22:17:08 testing kernel: [115255.428862] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:17:08 testing kernel: [115255.428862] free:11882 slab_reclaimable:4727 slab_unreclaimable:64987
Apr 9 22:17:08 testing kernel: [115255.428863] mapped:15715 shmem:15500 pagetables:161192 bounce:0
Apr 9 22:17:08 testing kernel: [115255.428865] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428872] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:17:08 testing kernel: [115255.428875] Node 0 DMA32 free:24448kB min:4272kB low:5340kB high:6408kB active_anon:2091648kB inactive_anon:522644kB active_file:176kB inactive_file:7944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3496kB shmem:360kB slab_reclaimable:2004kB slab_unreclaimable:97488kB kernel_stack:17712kB pagetables:239656kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:210 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428882] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:17:08 testing kernel: [115255.428885] Node 0 Normal free:7268kB min:7192kB low:8988kB high:10788kB active_anon:3766632kB inactive_anon:627872kB active_file:88kB inactive_file:244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59364kB shmem:61640kB slab_reclaimable:16904kB slab_unreclaimable:162460kB kernel_stack:29000kB pagetables:405112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428893] lowmem_reserve[]: 0 0 0 0
Apr 9 22:17:08 testing kernel: [115255.428895] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:17:08 testing kernel: [115255.428902] Node 0 DMA32: 278*4kB 127*8kB 33*16kB 119*32kB 81*64kB 44*128kB 6*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 24448kB
Apr 9 22:17:08 testing kernel: [115255.428909] Node 0 Normal: 881*4kB 20*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7268kB
Apr 9 22:17:08 testing kernel: [115255.428915] 18755 total pagecache pages
Apr 9 22:17:08 testing kernel: [115255.428916] 1043 pages in swap cache
Apr 9 22:17:08 testing kernel: [115255.428918] Swap cache stats: add 531680, delete 530637, find 103628/104282
Apr 9 22:17:08 testing kernel: [115255.428919] Free swap = 0kB
Apr 9 22:17:08 testing kernel: [115255.428920] Total swap = 2103292kB
Apr 9 22:17:08 testing kernel: [115255.447686] 2097136 pages RAM
Apr 9 22:17:08 testing kernel: [115255.447688] 48271 pages reserved
Apr 9 22:17:08 testing kernel: [115255.447689] 64969 pages shared
Apr 9 22:17:08 testing kernel: [115255.447690] 2006202 pages non-shared
Apr 9 22:17:08 testing kernel: [115255.447693] Out of memory: kill process 3016 (cron) score 308364 or a child
Apr 9 22:17:08 testing kernel: [115255.447696] Killed process 15547 (cron) vsz:50064kB, anon-rss:316kB, file-rss:4kB
Apr 9 22:17:08 testing kernel: [115255.753860] db2sysc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:08 testing kernel: [115255.753864] Pid: 3346, comm: db2sysc Not tainted 2.6.34-12-desktop #1
You must find the culprit of using too much memory. You can do that with a simple script recording the output of
ps
from time to time to using monitoring facilities like munin.Without exactly watching what is going on, it's not easy to know who is eating your memory and swap to the point of leaving none available, even I am inclined to guess on the Databases first.
How much memory is assigned to the Suse instance? Given you're running a lot of memory hungry services on it (3 RDBMS plus memcached), it's going to need a significant amount of the 8GB of memory to run.
You'll need to check both the memory reservation and limit setting in ESXi for the Suse instance - remember the limit setting could force the machine to swap out or even crash if it's set too low.