I am running a number of java processes on a single Linux machine. From a memory and computing standpoint, everything is fine when things are static.
However, periodically we use a configuration management package up upgrade the jar or war files, and restart the java process.
The problem is, that is restarts them all relatively quickly, and so we get 10 or so java VMs restarting all at the same time (we use daemontools for the service stops/starts), which wreaks havoc on the machine, in terms of OOMs or just really slow. This is because it's spawning the JVM 10x at the same time.
Other than trying to stagger the startups, is there a smarter way of handling this? Maybe a sysctl tuning performance parameters, or a JVM parameter?
I think the concept that cfengine is using is great for you. I wouldn't bother with sysctl tuning - it might hurt the actual runtime performance, after all! - but go with the same approach that cfengine has.
So what does it do? cfengine has splay time - if you have 1000 client nodes configured to connect the master server asking if there's something new to do, each of the client nodes will connect the server at the configured time + splay time. So, you can configure your nodes to connect your server at 00:00, 01:00, 02:00 ... but some of them will connect the server at 00:00:00, some of them at 00:00:30, some at 00:01:00 ... whatever you configure the splay range to be.
I don't suggest you to actually use cfengine, I just mentioned about the splay time. Go ahead and imitate its behaviour. :-)
If I were you, I would not try to startup 10 Java VMs at the same time, that sounds like some serious server torturing.
Is there some script you could call instead that will start up the Java processes and monitor it with either timers or monitoring resource usage before calling the next Java application? That sounds like the next thing to try if I had to approach it.
The other solution would be to throw more memory at it. I know it is fine once things are settled down but if it's doing that much thrashing at restarting the processes (and when the server starts) throwing hardware might be the most time-economical way to solve the problem. I vaguely recall from the podcasts that that was Atwood's philosophy when bootstrapping the StackOverflow/ServerFault/Superuser sites; hardware was cheap, so if resources get tight, add more. Could be misremembering it though. Just a thought.
One other approach is to edit the run files for the different java daemons and put an arbitrary random sleep before the damon is exec'd. That might not be what you want when manually restarting in other circumstances but it can help keep the load down when all are starting in a thundering herd.
sleep $[ $RANDOM % 300 ]