I am trying to use monit to find surefire processes running too long and kill them.
The machine is running parallel builds so it is possible to have several surefire processes runnig at the same time but there is no PID file for those processes.
My monit config looks like this:
check process surefire matching "surefire/surefirebooter"
if uptime > 4 hours then alert
if uptime > 4 hours then stop
The alert is sent, but the stop does not work.
I can't use killall since the process is run by java and there is several other java processes running.
All I need is to detect thee right PID of that process so I can kill the right one.
there is MONIT_PROCESS_PID environment variable propagated into context of program executed by exec command.
if uptime > 4 hours then stop
shoud be replaced by
if uptime > 4 hours then exec "/usr/bin/monit-kill-process.sh"
and the /usr/bin/monit-kill-process.sh should look like
The only problem is that the monit is not right tool for this job anyway, since it want the process matching the check pattern to be found everytime it perform the checking, otherwise it tries to start the process using start part of check definition (which is not exactly what we want to do).
So I found and modified this ps/grep/perl/xargs oneliner which I run through cron. It's able to find processes by it's command line substring, select long running ones and treat them well.
Monit may not be the right tool for this. The pattern matching only uses the first match.
This can be tested with
monit procmatch <pattern>
I'd suggest tagging your builds with a unique identifier and using that in the pattern matching sequence... Or managing the daemon entirely with monit.
You don't need to use killall either. Maybe some logic around pkill or pgrep.
Also see: monit: check process without pidfile