Me and my team have been struggling to keep a clustered ColdFusion application stable for the better part of the last 6 months with little result. We are turning to SF in the hope of some finding some JRun experts or fresh ideas cause we can't seem to figure it out.
The setup:
Two ColdFusion 7.0.2 instances clustered with JRun 4 (w/ the latest update) on IIS 6 under Windows Server 2003. Two quad core CPUs, 8GB RAM.
The issue:
Every now and again, usually once a week one of the instance will stop handling request completely. There is no activity on it what so ever and we have to restart it.
What we know:
Every time this happen JRun's error log is always full of java.lang.OutOfMemoryError: unable to create new native thread.
After reading JRun documentation from Macromedia/Adobe and many confusing blog posts we've more or less narrowed it down to incorrect/unoptimized JRun thread pool settings in the instance's jrun.xml.
Relevant part of our jrun.xml:
<service class="jrun.servlet.jrpp.JRunProxyService" name="ProxyService">
<attribute name="activeHandlerThreads">500</attribute>
<attribute name="backlog">500</attribute>
<attribute name="deactivated">false</attribute>
<attribute name="interface">*</attribute>
<attribute name="maxHandlerThreads">1000</attribute>
<attribute name="minHandlerThreads">1</attribute>
<attribute name="port">51003</attribute>
<attribute name="threadWaitTimeout">300</attribute>
<attribute name="timeout">300</attribute>
{snip}
</service>
I've enabled JRun's metrics logging last week to collect data related to threads. This is a summary of the data after letting it log for a week.
Average values:
{jrpp.listenTh} 1
{jrpp.idleTh} 9
{jrpp.delayTh} 0
{jrpp.busyTh} 0
{jrpp.totalTh} 10
{jrpp.delayRq} 0
{jrpp.droppedRq} 0
{jrpp.handledRq} 4
{jrpp.handledMs} 6036
{jrpp.delayMs} 0
{freeMemory} 48667
{totalMemory} 403598
{sessions} 737
{sessionsInMem} 737
Maximum values:
{jrpp.listenTh} 10
{jrpp.idleTh} 94
{jrpp.delayTh} 1
{jrpp.busyTh} 39
{jrpp.totalTh} 100
{jrpp.delayRq} 0
{jrpp.droppedRq} 0
{jrpp.handledRq} 87
{jrpp.handledMs} 508845
{jrpp.delayMs} 0
{freeMemory} 169313
{totalMemory} 578432
{sessions} 2297
{sessionsInMem} 2297
Any ideas as to what we could try now?
Cheers!
EDIT #1 -> Things I forgot to mention: Windows Server 2003 Enterprise w/ JVM 1.4.2 (for JRun)
The max heap size is around 1.4GB yeah. We used to have leaks but we fixed them, now the application use around 400MB, rarely more. The max heap size is set to 1200MB so we aren't reaching it. When we did have leaks the JVM would just blow up and the instance would restart itself. This isn't happening now, it simply stops handling incoming request.
We were thinking it has to do with thread following this blog post: http://www.talkingtree.com/blog/index.cfm/2005/3/11/NewNativeThread
The Java exception being thrown is of type OutOfMemory but it's not actually saying that we ran out of heap space, just that it couldn't create new threads. The exception type is a bit misleading.
Basically the blog is saying that 500 as activeHandlerThreads might be too high but my metrics seems to show that we get no where near that which is confusing us.
Well, let's look at some bigger picture issues before getting into JRun configuration details.
If you're getting java.lang.OutOfMemoryError exceptions in the JRun error log, well, you're out of memory. No upvote for that, please ;-). You didn't say whether you were running 32- or 64-bit Windows, but you did say that you have 8 GB of RAM, so that will have some impact on an answer. Whether or not you're running a 32- or 64-bit JVM (and what version) will also impact things. So those are a few answers that will help us get to the bottom of this.
Regardless, your application IS running out of memory. It's running out of memory for one or more of these reasons:
Other things to keep in mind: on 32-bit Windows, a 32-bit JVM can only allocate approximately 1.4 GB of memory. I don't recall off the top of my head if a 32-bit JVM on 64-bit Windows has a limitation less than the theoretical 4 GB max for any 32-bit process.
UPDATED
I read the blog post linked via TalkingTree and the other post linked within that post as well. I haven't run into this exact case, but I did have the following observation: the JRUN metrics logging may not record the "max values" you cited in a period of peak thread usage. I think it logs metrics at a fixed, recurring interval. That's good for showing you smooth, average performance characteristics of your application, but it may not capture JRUN's state right before your error condition begins to occur.
Without knowing about the internal workings of JRUN's thread management, I still say that it really is out of memory. Perhaps it's not out of memory because your app needed to allocate memory on the JVM heap and none was available, but it's out of memory because JRUN tried to create another thread to handle an incoming request and the heap memory necessary to support another thread wasn't available- in other words, threads aren't free- they require heap memory as well.
Your options seem to be as follows:
Regardless of the option you pursue, I think a valid fix here is going to be experimental in nature. You're going to have to make a change and see what affect it has on the application. You have a load testing environment, right?
Try and reduce the maximum heap size. Each thread requires native resources (along with java own stuff). The usable virtual AS is 2GB; 1.2GB is reserved for the heap. Part of the remaining 800MB is used for code (text segments of java and all required DLLs), then there are native allocations required by the JRE and its dependencies... and the threads: for each thread by default 1MB of AS is reserved (though just a page is actually committed), 100 threads = 100MB (just for the stacks). Now add a bit extra space between the various pieces, some fragmentation... OOM ;-)