I'm trying to start a program (Resque) but it takes a bit of time before a pidfile is written. Thus, I think that Monit thinks the program hasn't started and starts one or two more programs before the before the pidfile of the first one is written.
How do I delay the time Monit checks again, just for this process? Or should I solve this in another way?
You can check a specific service on a different interval than the default...
See SERVICE POLL TIME in the Monit documentation.
An example for your Resque program would be to check on a different number of cycles:
or from the examples section:
or you can leverage the cron-style checks.
or if you're experiencing a slow startup, you can extend the timeout in the service start command:
How do I delay the time Monit checks again, just for this process?
What you are trying to achieve could be done via "SERVICE POLL TIME" feature of monit
Monit documentation says
One of the method to customize service poll is
EVERY [number] CYCLES
Example:
Or should I solve this in another way?
I also did initial attempt to monitor resque jobs with monit because monit is a very lightweight daemon but eventually settled with GOD. I know , I know GOD is more resource hungry in comparison to monit but in case of resque we found it to be a good match.
You can also check if something has failed for X times straight:
Or for X times within Y polls:
Or both:
(from here)
A member of my team came up with a rather clever solution that allows monit to check frequently (every minute), but once it has attempted to restart the service (which takes ~10 minutes) it will wait a specified grace period before attempting to start again.
This prevents waiting too long between checks, which combined with slow start is a much larger impact to customers. It works by using an intermediate script that acts as flag to indicate monit is already taking action from the last failure.
If bamboo (slow starting web app) is down for 3 minutes in a row, restart, BUT only if a restart script is not already running.
The the script that is called has a specified sleep that waits LONGER then the slowest start time for the service (in our case we expect to finish in ~10, so we sleep for 15)
The current version of Monit (5.16) supports a timeout for the start scripts with the syntax:
The docs state:
Which is what the "timeout" value will do.