I want Upstart to do two things:
- stop trying to respawn a failed process so fast
- never give up trying to respawn
In an ideal world, upstart would try to restart a dead process after 1s, then double that delay on each attempt, until it reached an hour.
Is something like this possible?
The Upstart Cookbook recommends a post-stop delay (http://upstart.ubuntu.com/cookbook/#delay-respawn-of-a-job). Use the
respawn
stanza without arguments and it will continue trying forever:(I got this from this Ask Ubuntu question)
To add the exponential delay part, I'd try working with an environment variable in the post-stop script, I think something like:
** EDIT **
To apply the delay only when respawning, avoiding the delay on a real stop, use the following, which checks whether the current goal is "stop" or not:
As already mentioned, use
respawn
to trigger the respawn.However, the Upstart Cookbook coverage on
respawn-limit
says that you'll need to specifyrespawn limit unlimited
to have continual retry behaviour.By default it will retry as long as the process doesn't respawn more than 10 times in 5 seconds.
I would therefore suggest:
I ended up putting a
start
in a cronjob. If the service is running, it has no effect. If it's not running, it starts the service.I have done an improvement to Roger answer. Typically you want to backoff when there is a problem in the underlying software causing it to crash a lot in a short period of time but once the system has recovered you want to reset the backoff time. In Roger's version the service will sleep for 60 seconds always, even for single and isolated crashes after 7 crashes.
You want
respawn limit <times> <period>
- although this would not provide the exponential behavior you are looking for, it probably would do for most use cases. You might try using very large values fortimes
andperiod
to approximate what you try to achieve. See the man 5 init's section onrespawn limit
for reference.Others have answered the question for respawn and respawn limit stanzas, but I would like to add my own solution for the post-stop script that controls the delay between restarting.
The biggest problem with the solution proposed by Roger Dueck is that the delay causes 'restart jobName' to hang until the sleep is completed.
My addition checks to see if there is a restart in progress before determining whether or not to sleep.