When monitoring the healthy of a server, some faults or warnings are immediately urgent but others only matter if they persist. I'm thinking of things like:
- Some software needs to be updated
- Time offset differs from NTP
If unaddressed these could become real problems, but there are already background services in place to take care of them - unattended upgrades, an NTP client service etc. There's always a short delay between the problem arising and these background processes kicking in to address them, and our monitor is sending out a series of emails in that gap - then again a minute later when the issue is fixed. I generally wake up to a large pile of "PROBLEM" emails, each with a corresponding "RESOLUTION" emails sent a minute later. The danger is that in dismissing a hundred irrelevant warnings, I could miss the one that's real.
So is there any way of instructing Icinga or Nagios to only report an issue if it's continued for more than a certain time, say 5 minutes?
SvW is not wrong in what (s)he writes, but you should also investigate the variable
max_check_attempts
, which defines how many checks a service has to fail before going HARD error and notifying.For some of my hair-trigger services, I have
which means that NAGIOS will check more often than usual, and once it notices something's down, it'll wait 1 minute, check once more, then notify. For other services, where I don't care until it's been down a while, I have
which means that once NAGIOS notices something's down, it'll carry on checking every 5 minutes as usual, and not tell me until it's been down for an hour.
It is definitely worth tuning your NAGIOS until it tells you about the things you care about, at the time you care about them, and nothing else; a monitoring system that emits a cloud of false positives (ie, sends you loads of notifications you don't really care about) is nearly as useless as one that has false negatives (ie, fails to notice a real problem).
You can define detailed configurations to tell Nagios every detail about the check for a service.
Look up the
check_interval
andretry_interval
config options, and while you are at it, learn about time periods in general.