I have a few servers that are monitored by munin, and fairly frequently one of a selection of units has a transient failure to read data. That gets me two emails, one telling me that all the values are unknown and the second five minutes later letting me know that everything's OK after all.
As far as I can tell, munin is functioning as designed here, but I'd like to know if there's any way to delay sending the initial 'unknown' alert for one update cycle, so transient unknowns aren't reported? All my current setup is achieving is training me to ignore the warning mails.
Failing that, is there any way to disable sending the 'unknown' alerts and their corresponding recovery alerts altogether?
I don't really use Munin but as I see an
unknown_limit
setting can be set for items/plugins which defines how many consecutive unsuccessful readings should occur before setting a value to "unknown".Based on the Munin::LimitsOld module it defaults to 3, I think you should try to set or increase this number.
I have checked this on Munin 1.4.5.
I achieve this by having munin notify, not directly to end-users, but into NAGIOS via NSCA, and having NAGIOS handle the notifications. This means I can use NAGIOS' (much more sophisticated) controls on notification delay, frequency, escalations and so on. Yes, NAGIOS is quite heavyweight just to be a notification engine, but you can then use it for qualititive (rather than quantitative) monitoring as well.
Remember that one of the big benefits of using open source tools is that you can look at the source to see exactly what they do (and change the behaviour if you don't like it). A quick scan of LimitsOld.pm shows that Gábor's suggestion is the right approach - unknown_limit can be set on a per-service basis or globally and appeared around Munin 1.4.4 (see http://munin-monitoring.org/ticket/828).