We are not yet experiencing any application errors but our monitoring tools are indicating that our application is running at the limits of it's resources. Should we first add more heap or add an additional VM?
We have an application running on WebLogic/JRockit in a managed cluster.
We have AppDynamics monitoring this application and it shows that major garbage collections are happening frequently (every 1-2 minutes on average!!!). When a major garbage collection runs it does recover space and the lower range on heap usage is reasonably low, even after the system has been up for a while (weeks/months). Additionally, we ran the AppDynamics collections leak detection against production and it found no leaks. (We couldn't run the custom monitoring because it's not supported with JRockit.) But overall it seems clear that there are no major leaks, just that the system requires more resources than it currently has.
We have two non-production environments also running this application with reduced resources and reduced load (dev and test). The test environment has 2/3rds the number of VMs and 1/2 the heap per VM. We ran some load tests against this environment, but the results were not very helpful. While we can recreate the number of users using automated scripts, the data in our test environment is very different--queries are returning orders of magnitude less data, etc. (Creating a better load testing environment is certainly on the ToDo list, but unlikely to actually happen any time soon for reasons of bureaucracy.) Even with everything we could throw at it, the test environment did not break a sweat.
Two options, A) Add more heap. It seems like this would help for sure, but getting this done will require lots of paperwork (would require adding more memory to the physical servers, which means server restarts involving lots of other applications, etc.). Also, I have no idea how much more memory to add and we cannot just "test in prod". B) Add another VM (or two) for this application. This would be fairly easy, we have space on another physical server, so we could get it done fairly quickly. But I am not sure it would help much, and if it doesn't help then going back to option A later would be even harder.
Specific questions: 1) Is either one of the above options obviously better (and why)? 2) If neither are obviously better, what tests, etc. would I do to decide which is better? 3) How should I decide and justify how many more resources to add (heap or VMs)? (Bonus points here if it involves the tools we already have available.)
Updates:
- 3 JVMs in a cluster, each JVM is on a separate VM.
- They are behind an Apache load balancer, each server gets roughly equal load.
- Each JVM has 1 GB heap.
- No FMW.
Assuming that the application has been thoroughly profiled and no memory leaks exist (as it seems to be case), you have to work with the premise that the objects that are being created in the heap are due to the normal activity of the application.
Obviating code optimisations, and/or even more fine tuning of the memory heap based on the size and lifecycle of the objects being created (which in turn is subject to the specific JVM you use), there's not much room for improvement other than adding more managed nodes to your domain.
This can be easily achieved using a tool already present in every WebLogic installation, namely WLST.
It is well documented how to create managed nodes and their respective node managers to an existing cluster using WLST.
We ended up doing both (adding more heap space from 1GB to 1.5GB and adding more managed nodes from 3 nodes to 5).
The heap was increased about an hour before the new nodes were added and was, by itself, enough to significantly reduce the number of garbage collections and time spent in garbage collection.
Adding more nodes caused only a minor improvement, but it's difficult to determine if it really wasn't very helpful, or or if there just wasn't much room for improvement after increasing the heap.