Does anyone know if Nagios has an option to set a temporary check_interval setting on a service check and have it revert back after a X minutes?
My service check definition looks like this:
define service {
host_name prodhost
use http
service_description www.example.com:8080
check_command check_http!8080!example.com:8080!/!5.000!10.00
servicegroups http-check
check_interval .5
retry_interval .25
max_check_attempts 3
}
The problem is that each time changes are made onto my web app (via a CI), it also has to restart the application as part of the deployment process. Which triggers some of my 5 second and 10 second warning and critical alerts.
I would like to be able to keep my current check_interval, retry_interval, and max_check_attempts thresholds in tact, but be able to temporarily change them whenever a deployment is made and have it revert back to its original state after 3 minutes.
What you're referring to is Adaptive Monitoring. It's not necessarily the best way to achieve your goal, but you can change these parameters with external commands. For example, you can submit them remotely via a script that connects using ssh with keys.
(Due to your sub-one-minute intervals, it will take some tweaking to get the timing right, as Nagios might not have processed the command yet before your next check is run.)
You would have your deployment workflow send a command to Nagios to modify the parameters, and then send another one to change them back later (after the service restarts). Similarly, you could disable active checks or notifications, temporarily, instead.
What you should probably do instead is (automatically) put the service(s) in scheduled downtime, via the deployment process. Downtime has the benefit of having a stop/end time, so you don't have to submit a second command to revert your changes.