I have two servers running in a private vSphere cloud, both running JBoss and Tomcat.
- Machine 8 - RHEL5.3, 3 gigs of 'physical' memory, 1 gig of swap
- Machine 25 - RHEL4.6, 2 gigs of 'physical' memory, 1 gig of swap
Every so often, machine 8 will become unresponsive, with OOMKiller effectively taking over the machine. Rebooting through the vSphere admin console has been the only option.
We have always assumed that it has been the Java applications' limits (Xmx etc) being set too high. Therefore, after the latest restart, we took the opportunity to reduce the memory limits on the JVMs and to set up some scripts that log certain information.
This time around, the problem seems to have happened on both machines, though the logging specific to this problem has only been on machine 8.
What is interesting is that the swap usage doubles in a minute, but the Java applications' usage does not. Sadly our logging was concentrating on the JVMs, so we don't know what was actually requesting all this memory.
Here is the logging of memory usage up until the machine stopped responding (reconsituted from the logs of the Top information for the various JVMs):
Time Load Average Phys Used Virt Used
00:19:23 1,01 3016868 380872
00:20:27 3,44 3025136 435216
00:20:32 3,24 3029548 475548
00:21:37 3,51 3023888 864404
00:21:43 3,39 3030808 889608
So the virtual memory use has gone up from 380 megs to 889 megs in under 2.5 minutes.
I am aware of this problem, but don't really know if it is the same problem - the Java usage does not seem unreasonable on our machine, and the machine that suffers from this problem the most is on RHEL5.3.
We have not activated the vm.lower_zone_protection
option as suggested in the linked question.
Does anyone have any suggestions or explanations?
Also, is the fact that machine 25 went down as well a coincidence, or could there be circumstances within vSphere that could cause them both to react in this way?
It's the JVM that's causing the problems. Basically, the JVM "pre-allocates" all the memory it wants, but the kernel actually only "gives" the memory to the JVM when it really needs it. Thus, you can very quickly get into the situation you describe (swap usage shoots up, OOM killer goes on a rampage) without anything "obviously" using the memory -- because, in principle, the memory is already "in use".
Solutions include tuning the JVM to not use as much memory, turning off overcommit (not a good idea, for reasons I've explained before), provide more swap (the machine will slow to a crawl, but won't die, giving you a chance to get in and examine the problem "live"), or just give the VMs more memory. It's cheap enough.