I'm running a CentOS centos-release-6-0.el6.centos.5.x86_64 (2.6.32-71.29.1.el6.x86_64) box with 32GB RAM and 6 vCPUs as a virtual machine. Java is running in version Java(TM) SE Runtime Environment (build 1.6.0_27-b07).
Every once in a while the OOM Killer shoots my JBoss, which is configured not to use more than 13GB RAM. The JBoss parameters are
\_ java -Dprogram.name=run.sh -server -Xms12288m -Xmx12288m -XX:MaxPermSize=1024m -Dorg.jboss.resolver.warning=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djava.endorsed.dirs=/usr/java/jboss-as/lib/endorsed -classpath /usr/java/jboss-as/bin/run.jar org.jboss.Main -b 0.0.0.0 --configuration=default
When this happens, the following lines are written to /var/log/messages
:
Feb 13 12:21:02 prod-app kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Feb 13 12:21:02 prod-app kernel: java cpuset=/ mems_allowed=0
Feb 13 12:21:02 prod-app kernel: Pid: 11903, comm: java Not tainted 2.6.32-71.29.1.el6.x86_64 #1
Feb 13 12:21:02 prod-app kernel: Call Trace:
Feb 13 12:21:02 prod-app kernel: [<ffffffff810c2e01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110f1bb>] oom_kill_process+0xcb/0x2e0
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110f780>] ? select_bad_process+0xd0/0x110
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110f818>] __out_of_memory+0x58/0xc0
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110fa19>] out_of_memory+0x199/0x210
Feb 13 12:21:02 prod-app kernel: [<ffffffff8111ebe2>] __alloc_pages_nodemask+0x832/0x850
Feb 13 12:21:02 prod-app kernel: [<ffffffff81150cba>] alloc_pages_current+0x9a/0x100
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110c617>] __page_cache_alloc+0x87/0x90
Feb 13 12:21:02 prod-app kernel: [<ffffffff8112136b>] __do_page_cache_readahead+0xdb/0x210
Feb 13 12:21:02 prod-app kernel: [<ffffffff811214c1>] ra_submit+0x21/0x30
Feb 13 12:21:02 prod-app kernel: [<ffffffff8110e1c1>] filemap_fault+0x4b1/0x510
Feb 13 12:21:02 prod-app kernel: [<ffffffff81135604>] __do_fault+0x54/0x500
Feb 13 12:21:02 prod-app kernel: [<ffffffff81135ba7>] handle_pte_fault+0xf7/0xad0
Feb 13 12:21:02 prod-app kernel: [<ffffffff8125f78c>] ? rb_erase+0x1bc/0x310
Feb 13 12:21:02 prod-app kernel: [<ffffffff81056720>] ? __dequeue_entity+0x30/0x50
Feb 13 12:21:02 prod-app kernel: [<ffffffff810117bc>] ? __switch_to+0x1ac/0x320
Feb 13 12:21:02 prod-app kernel: [<ffffffff81059e02>] ? finish_task_switch+0x42/0xd0
Feb 13 12:21:02 prod-app kernel: [<ffffffff8113676d>] handle_mm_fault+0x1ed/0x2b0
Feb 13 12:21:02 prod-app kernel: [<ffffffff814c92b6>] ? thread_return+0x4e/0x778
Feb 13 12:21:02 prod-app kernel: [<ffffffff814ce503>] do_page_fault+0x123/0x3a0
Feb 13 12:21:02 prod-app kernel: [<ffffffff814cbf75>] page_fault+0x25/0x30
Feb 13 12:21:02 prod-app kernel: Mem-Info:
Feb 13 12:21:02 prod-app kernel: Node 0 DMA per-cpu:
Feb 13 12:21:02 prod-app kernel: CPU 0: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 1: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 2: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 3: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 4: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 5: hi: 0, btch: 1 usd: 0
Feb 13 12:21:02 prod-app kernel: Node 0 DMA32 per-cpu:
Feb 13 12:21:02 prod-app kernel: CPU 0: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 1: hi: 186, btch: 31 usd: 30
Feb 13 12:21:02 prod-app kernel: CPU 2: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 3: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 4: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 5: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: Node 0 Normal per-cpu:
Feb 13 12:21:02 prod-app kernel: CPU 0: hi: 186, btch: 31 usd: 10
Feb 13 12:21:02 prod-app kernel: CPU 1: hi: 186, btch: 31 usd: 30
Feb 13 12:21:02 prod-app kernel: CPU 2: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 3: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: CPU 4: hi: 186, btch: 31 usd: 36
Feb 13 12:21:02 prod-app kernel: CPU 5: hi: 186, btch: 31 usd: 0
Feb 13 12:21:02 prod-app kernel: active_anon:7449508 inactive_anon:565931 isolated_anon:0
Feb 13 12:21:02 prod-app kernel: active_file:0 inactive_file:665 isolated_file:0
Feb 13 12:21:02 prod-app kernel: unevictable:0 dirty:2 writeback:0 unstable:0
Feb 13 12:21:02 prod-app kernel: free:49966 slab_reclaimable:2775 slab_unreclaimable:143965
Feb 13 12:21:02 prod-app kernel: mapped:70 shmem:0 pagetables:21396 bounce:0
Feb 13 12:21:02 prod-app kernel: Node 0 DMA free:15584kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15188kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb 13 12:21:02 prod-app kernel: lowmem_reserve[]: 0 3000 32290 32290
Feb 13 12:21:02 prod-app kernel: Node 0 DMA32 free:123316kB min:6276kB low:7844kB high:9412kB active_anon:1954808kB inactive_anon:508096kB active_file:0kB inactive_file:668kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:184kB slab_unreclaimable:112kB kernel_stack:0kB pagetables:3904kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:216 all_unreclaimable? yes
Feb 13 12:21:02 prod-app kernel: lowmem_reserve[]: 0 0 29290 29290
Feb 13 12:21:02 prod-app kernel: Node 0 Normal free:60964kB min:61276kB low:76592kB high:91912kB active_anon:27843224kB inactive_anon:1755628kB active_file:0kB inactive_file:1992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:8kB writeback:0kB mapped:332kB shmem:0kB slab_reclaimable:10916kB slab_unreclaimable:575748kB kernel_stack:2968kB pagetables:81680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? yes
Feb 13 12:21:02 prod-app kernel: lowmem_reserve[]: 0 0 0 0
Feb 13 12:21:02 prod-app kernel: Node 0 DMA: 2*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15584kB
Feb 13 12:21:02 prod-app kernel: Node 0 DMA32: 95*4kB 74*8kB 39*16kB 25*32kB 8*64kB 8*128kB 3*256kB 4*512kB 2*1024kB 38*2048kB 9*4096kB = 123484kB
Feb 13 12:21:02 prod-app kernel: Node 0 Normal: 1234*4kB 798*8kB 561*16kB 411*32kB 203*64kB 93*128kB 9*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 60648kB
Feb 13 12:21:02 prod-app kernel: 10130 total pagecache pages
Feb 13 12:21:02 prod-app kernel: 9240 pages in swap cache
Feb 13 12:21:02 prod-app kernel: Swap cache stats: add 23810210, delete 23800970, find 5525856/6576436
Feb 13 12:21:02 prod-app kernel: Free swap = 0kB
Feb 13 12:21:02 prod-app kernel: Total swap = 8388600kB
Feb 13 12:21:02 prod-app kernel: 8388592 pages RAM
Feb 13 12:21:02 prod-app kernel: 134650 pages reserved
Feb 13 12:21:02 prod-app kernel: 423 pages shared
Feb 13 12:21:02 prod-app kernel: 8069227 pages non-shared
Feb 13 12:21:02 prod-app kernel: Out of memory: kill process 11666 (run.sh) score 21512214 or a child
Feb 13 12:21:02 prod-app kernel: Killed process 11696 (java) vsz:50991828kB, anon-rss:32016636kB, file-rss:400kB
JBoss seems to be taking way more RAM than it should which suggests that we might have a memory leak in our application. The funny thing is: we have around 20 other installations of the software (also on CentOS boxes), and this behavior is unique to this machine.
I'm unsure on how to debug this. There is no information logged within the JBoss server.log
- Is there a chance to preempt the OOM Killer by adjusting the GC parameters?
- Is there any tool which could help looking into the events happening on the server when the situation becomes crucial?
Thanks a lot for your help!
Be well
S.
This is a bit of a shot in the dark, but: You're using
-Xms12288m -Xmx12288m
which will only limit the heap portion of your memory.Since your process grew to ~ 50 GB memory usage, this could have been caused by a huge stack. So if your application spawns a lot of threads, heavily relies on recursion or stuff like that, this will be your cue.
If that's the case, look into things like GC logging,
jmap
,jstack
,pmap
,jvisualvm
to analyze the situation and into-Xss
,-XX:MaxThreadStackSize
JVM parameters and ThreadPool sizes to remedy the problem.(Not entirely sure about the parameters, Java 5 was the last version I had to do with professionally.)