I've been trying to track down the cause of intermittent slowdowns of our Tomcat server. We get the slowdowns several times a week, where Tomcat will stop responding or take several minutes to process requests, and the cpu loads on the (Linux) box, as shown by uptime, spike from typically around 1-2 to over 30. Then things gradually work themselves out and everything goes back to normal after maybe 10 minutes or so.
We're using Apache as a front end and Postgres for our database. I've been digging through the logs to try and figure out what might be causing the problem. I haven't noticed any obvious increases in demand around the times of the slowdowns.
What I have found is that on multiple occasions, just prior to the slowdown, Tomcat appears to just go to sleep for about three and a half minutes. There are no entries in its logs during that time, and no queries from Tomcat to the database. After its little nap, Tomcat will wake up and furiously start trying to process all the stuff that has backed up in the meantime, leading to heavy database and cpu loading and slow response times.
In order to try and figure out what Tomcat is doing during its nap time, I set up a script to monitor its log and send a kill -3 signal to get a thread dump if there has been no activity in the log for three minutes. Unfortunately, the signal doesn't wake Tomcat, so the thread dump doesn't occur until after it has woken up of its own accord and has resumed processing.
Apache and Postgres are apparently still awake and active during the three and a half minute gaps -- their logs show non-Tomcat-related activity continuing during those periods.
Our Tomcat version is 5.0.28.
Thoughts, suggestions? I'm pretty new at working with Tomcat so please don't assume much knowledge.
After activating verbose garbage collection per Alex's suggestion, I captured a couple occurrences of the problem and found that a Full GC was responsible, taking over 200 seconds in both cases, E.g.:
04:21:55.648491500 [GC 1035796K->933637K(1041984K), 0.3407580 secs]
04:21:56.012832500 [Full GC[Unloading class sun.reflect.GeneratedMethodAccessor633]
04:22:38.003920500 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor39]
04:22:38.004051500 [Unloading class sun.reflect.GeneratedConstructorAccessor102]
04:22:38.004392500 [Unloading class sun.reflect.GeneratedConstructorAccessor98]
04:22:38.004533500 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor40]
04:22:38.004716500 [Unloading class sun.reflect.GeneratedMethodAccessor634]
04:22:38.004808500 [Unloading class sun.reflect.GeneratedConstructorAccessor90]
04:22:38.004889500 [Unloading class sun.reflect.GeneratedConstructorAccessor95]
04:22:38.005044500 [Unloading class sun.reflect.GeneratedMethodAccessor632]
04:25:18.688916500 933637K->154281K(1041984K), 202.6760940 secs]
Now I just need to figure out how to tune things to prevent that. (Suggestions welcome.)
Thanks for the help Alex and Mainguy.
Step one, as stated above, is to change the tomcat start script to add
When you have your slowdown, look for things in catalina.out like "FullGC" or many GCs...
I'd note, if you haven't already done so, bump up the tomcat heap size to somewhere around 1/2 to 3/4 available memory assuming this box JUST runs tomcat. For example, to set the max heap to 768 megabytes, you would add:
to
JAVA_OPTS
If you're using ubuntu 10.04, these settings would generally be located in /etc/default/tomcat6.
We've had this happen when a good chunk of the memory in the "tenured" generation of the Java heap has been swapped out to disk because it is garbage and hasn't been used in a while. When a full collection is needed, that memory must be swapped back in.
In this case, your answer is somewhat counter-intuitive: REDUCE the size of the Java heap, or figure out what other stuff is using RAM that is causing the swapping. In our case, some nightly batch jobs used a bunch of RAM, causing the old generation to get swapped out to disk. So, then, the first Full GC needed the next morning took FOREVER (180+ seconds, much as you are seeing).
You might also try the concurrent-mark sweep collector, which reduces Full GC times by doing much of the work in parallel. This is the best documentation I've seen; there are also some good Sun blogs on the subject: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html
Try activating verbose garbage collection and see if it's a garbage collection pause. I guess a huge heap, lots of object allocation and swap might cause a long pause, but that long sounds quite unusual.