I have a server which is monitored by Watchdog, and experiences reboots occasionally due to faulty network hardware I am unable to replace at the moment. As I read, Watchdog sends a SIGTERM to all processes, requesting a safe shutdown, and after a short time, sends a SIGKILL, which will stop the process immediately. However, in this case, it leads to data corruption since the main process of importance is not fully shut down yet and has unwritten data.
How long is this pause Watchdog takes between asking all processes to stop, and forcing them to stop? Is it hardwired within Watchdog, set in watchdog.conf(if it was, it never got documented in the manpage), or the same as another system setting? How may I change this setting?
Edit: I've found the timeout, but I am still looking for instructions on how to rebuild and integrate with the system properly.
From the Watchdog source,
shutdown.c
, line 445, the pause is hardwired into Watchdog, and is five seconds.I have posted some information about building, configuring and testing the Linux watchdog daemon here:
http://www.sat.dundee.ac.uk/~psc/watchdog/Linux-Watchdog.html
The short answer is you need to configure your system to build for this project:
Then get the source code, which you probably already have, but the latest can be done with these steps:
Move to the code directory and prepare for compiling:
Then to the source directory and compile it:
In the current directory you will have the new binaries. Test them before you make them "live" by using
sudo make install
, or at least make back-up copies of the system supplied programs. They are bloated compared to the system ones due to debug symbols, you can use the strip command if you want to reduce their size.Can you say what sort of time you need for SIGTERM to work?
Edited to add:
If you are using the current GIT pull (14 Sep 2013) then edit shutdown.c and at line 363 change "safe_sleep(4);" to have your wanted timeout value in seconds. If editing the code for the system-supplied watchdog (as referenced above) take care not to sleep() for more than the hardware timeout (normally 60 seconds) as the system will simply reboot! That was the reason for the safe_sleep() function, to keep the watchdog fed while waiting.
The official 5.15 version at Sourceforge now has this option included. It can be configured in the file watchdog.conf using the line:
sigterm-delay = 5
(commented out in the example file). Please note the experimental 'V6' version should not be used any more as 5.15 has practically all of its features and several bug-fixes as well. Also note that the 'sat' web site might be shut down later in 2019 due to the withdrawal of NERC funding.