We have a CF8 app that runs for 20-25 minutes before crashing under heavy load ~ 1200 users. This load is generated by our load testing tool: 1200 users ramped up in 5 mins (approx behavior of our users), running for an hour.
We have this app on Solaris 10, Apache 2, JRun 4 and Oracle 10g. Java version is 1.6.
During the initial load tests, the thread dumps pointed to monitor deadlocks that pointed to sessions.
"jrpp-173":
waiting to lock monitor 0x019fdc60 (object 0x6b893530, a java.util.Hashtable),
which is held by "scheduler-1"
"scheduler-1":
waiting to lock monitor 0x026c3ce0 (object 0x6abe2f20, a coldfusion.monitor.memory.SessionMemoryMonitor$TopMemoryUsedSessions),
which is held by "jrpp-167"
"jrpp-167":
waiting to lock monitor 0x019fdc60 (object 0x6b893530, a java.util.Hashtable),
which is held by "scheduler-1"
We increased the number of sessions relative to the number of CPUs (48 simultaneous threads against 32 CPUs), and the deadlock went away. While varying the simultaneous threads helped a little bit in terms of response time, the CF server still tanked in 20-25 minutes during all of these tests.
We ran more thread dumps, and saw a thread locking a monitor, for e.g.:
"jrpp-475" prio=3 tid=0x02230800 nid=0x2c5 runnable [0x4397d000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(HashMap.java:347)
at java.util.HashMap.containsKey(HashMap.java:335)
at java.util.HashSet.contains(HashSet.java:184)
at coldfusion.monitor.memory.MemoryTracker.onAddObject(MemoryTracker.java:124)
at coldfusion.monitor.memory.MemoryTrackerProxy.onReplaceValue(MemoryTrackerProxy.java:598)
at coldfusion.monitor.memory.MemoryTrackerProxy.onPut(MemoryTrackerProxy.java:510)
at coldfusion.util.CaseInsensitiveMap.put(CaseInsensitiveMap.java:250)
at coldfusion.util.FastHashtable.put(FastHashtable.java:43)
- locked <0x6f7e1a78> (a coldfusion.runtime.Struct)
at coldfusion.runtime.CfJspPage._arrayset(CfJspPage.java:1027)
at coldfusion.runtime.CfJspPage._arraySetAt(CfJspPage.java:2117)
at cfvalidation2ecfc1052964961$funcSETUSERAUDITDATA.runFunction(/app/docs/apply/cfcs/validation.cfc:377)
As you see in the last line above there were several references CFMs and CFCs, and the lines have "cflock" tags, which were scoped to the "application." We (the dev team) then changed them to be scoped to a "name".
After more load tests, there is no locking going on and there no deadlocks, but now the application tanks in 7-10 minutes.
We've gotten system, network and DB reports from the respective admins, and they are not being taxed; even watched the server stats with server monitor, top, prstat, ran sar reports, etc. So we believe it is an issue with the CF server or maybe the JVM.
I am running out of ideas as to what else we can try. Disclaimer: I am not a CF developer or Admin. I am just running the load test, analyzing the reports, threads etc, and sharing the results with the dev and admin teams, and trying the next change, and so on. So far no dice.
Has anyone run into something similar? How did you go about diagnosing and troubleshooting? All thoughts and pointers welcome.
Thank you for your time!
KM
When you lock the application scope (either name or a "scope" lock) you are essentially "serializing" the application variables. So that in itself is a problem.... app vars should be locked and written once and then read at will (preferrably and safely without locks since they are not changing). It might be a red herring anyway. when you lock the app scope and access your locks all the time you naturally end up with things related to it in the log (since so much is happening there).
Here's My best guesses
Check the "coldfusion monitor" and make sure (double, triple, quadrupal) that the "memory profiling" is not enabled. In fact, for your load test make sure that tracking and monitoring is not enabled - and I would even turn off metrics. Your snippet above makes me think that memory tracking is on and you are simply overrunning the server. It happens because the heap has to be constantly introspected for changes - extremely expensive. Memory tracking should only be enabled for a short time when you are baselining.
32 CPUs and 48 sim. requests.... that's a lot of CPU power. If you are using JRUN make sure the Jrun threads are set far enough north of that that Jrun has some control threads to work with.
Check the bin directory for hot spot errors - errors beginning with "hs_errpidXXX.log" (or something similar). If you see them the stack might give you some clues - but typically it's stuff like out of mem errors or 3rd party CFX or native java code that's throwing unhandled exceptions to the hotsport compiler.
Finally, double check your client vars. Sometimes you are using them when you don't "thinK" you are using them. If you are they will be stored in a file on your Solaris server and it will definitely cause a problem under load - especially as the file is appended to or locked during a purge. Check out this post for a bit more info:
http://www.coldfusionmuse.com/index.cfm/2009/10/27/registry.bad.datasource.good
That's all that comes to mind at the moment... if you want me to look at it closer on Monday (on a formal basis) let me know and we can chat about it.
Thank you Mark and Adam for taking the time to look over my question and for your suggestions. This particular issue has been resolved. Evidently, our new server was built off of an old server with "corrupt CF" -- whatever that means. It has been since rebuilt anew, and the performance isn't an issue anymore.
KM