Thanks to the Intel TCO watchdog some servers i manage now reboot on a kernel or hardware crash and init scripts are now even 'rebootsafe'. Sadly this means that i no longer get a notification from nagios when a machine has crashed because the service is simply back up before the checks fire for enough times to send a notification.
Is there a reliable script or nagios check out there that will let me get notified if say the machine has crashed say 3 times during the last 48 hour period?
How about you write one? An easy way would be to run
uptime
in the script. A slightly better way would be to add an initscript that echos the time to a rotating logfile. Grab the last three entries in the file, and check the elapsed time since the first.There are a number of "check_uptime" variants on Nagios Exchange. These allow you to catch quick reboots, without setting max_check_attempts to 1 or 2 for the host check (therefore preventing false positives).
This one, for example, can be run via NRPE (uses
uptime
), but can also check via SNMP (Linux, Windows, etc.).