Running Nagios Core 4.0.2 and using latest NRPE on clients.
We have 3 service definitions to check one piece of software at different levels each minute:
- Open TCP port check
- Process is running check
- Application layer check by sending data to the socket and expecting some return value
Upon a failed state of any of those checks we would call an event_handler to restart the service up to 3 times. If the state isn't OK after 3 then escalate.
The issue is there are a number of combinations where if one service will fail another is expected to be in a CRITICAL state. If we have a event_handler for each of these and two fail then the restart script via event_handler would be called twice.
- e.g. If the process is not running, then the TCP port will not be open and the application layer check will fail.
- e.g. The TCP port could be CRITICAL because of a firewall misconfigured rule or network condition and the application layer will fail because it can't be reached but the process is still running
Question: How can we ensure the event handler gets called by only one failed service check and not for each failed service of the 3 resulting in 2 or more restarts as their states change to CRITICAL? e.g. if the 3 service checks enter CRITICAL then that would be 3 restarts in 1 minute, and 6 restarts in 2 minutes (assuming the restart failed to bring the services back to OK state).
I believe service dependencies may be the right solution but I'm not sure how to go about creating them to satisfy the different conditions.