Running Nagios Core 4.0.2 and using latest NRPE on clients.
We have 3 service definitions to check one piece of software at different levels each minute:
- Open TCP port check
- Process is running check
- Application layer check by sending data to the socket and expecting some return value
Upon a failed state of any of those checks we would call an event_handler to restart the service up to 3 times. If the state isn't OK after 3 then escalate.
The issue is there are a number of combinations where if one service will fail another is expected to be in a CRITICAL state. If we have a event_handler for each of these and two fail then the restart script via event_handler would be called twice.
- e.g. If the process is not running, then the TCP port will not be open and the application layer check will fail.
- e.g. The TCP port could be CRITICAL because of a firewall misconfigured rule or network condition and the application layer will fail because it can't be reached but the process is still running
Question: How can we ensure the event handler gets called by only one failed service check and not for each failed service of the 3 resulting in 2 or more restarts as their states change to CRITICAL? e.g. if the 3 service checks enter CRITICAL then that would be 3 restarts in 1 minute, and 6 restarts in 2 minutes (assuming the restart failed to bring the services back to OK state).
I believe service dependencies may be the right solution but I'm not sure how to go about creating them to satisfy the different conditions.
Service dependencies is the way to do it.
You want to make your Application layer service check dependent on your Process Running check.
You want to make your Process Running Check dependent on your TCP port check.
You want to make all of these dependent on a host (not service) check - this addresses the "network conditon" failure scenario.
These can get very complicated very quickly, but the basic idea is:
}
execution_failure_criteria is really the workhorse here: it lists states that the master service can be in for this service not to be checked (in this case, if service "The Service I Depend On" is in state "notify". You can specify multiple options (as in the line below).
Here's a good explanation of nagios configuration options. http://nagios.frank4dd.com/docs/en/objectdefinitions.html#servicedependency