I find it pretty common for a Linux server to slow down to the point of complete unresponsiveness (LA 150+ etc), which when looking at it later using sar or munin or whatever it will show a sudden rapid increase in the number of processes. I generally need to reboot the machine at this point but it always leaves me wondering what caused the problem in the first place.
I'm assuming there is a rogue process going into some kind of loop creating loads of new processes, which then eat up the ram etc and cause the lockup. But how, after the event, can I determine which is the offending application/ process?
Thanks
Install
atop
and configure it to save a snapshot every 60 seconds. Then, when your system goes nuts again, you can reboot and useatop -r /var/log/atop.log
to go back in time and see what went wrong.