I'd like to be able to create a notification that alerts based on the availability of a group of services, instead of just one threshold. For example, say I have 10 AWS servers that all do the same thing, and I expect some of them to be overloaded / fail sometimes without hurting the application: I want Check_MK to notify me if 3 out of 10 of the servers (or higher) fail in a given service. If one fails, don't notify me. Another example, perhaps simpler: say you have an NFS mount point on 20 servers, from the same NFS server. I don't want to get 20 warnings or criticals when I can just get one.
The above examples in my environment are already grouped in service groups.
I tried three different mechanisms in Check_MK 1.2.6p16:
Using Business Intelligence. The grouping and warning setting was fantastic, it did what I wanted! But the Notification Rules don't allow for anything related to the BI components of the product!
Cluster - I set up a cluster for the AWS servers, but since some of my checks (most notably my HTTP active check) requires a hostname, that's not going to help. I don't think Cluster is the right rabbithole to go down here, but correct me if I'm wrong. I abandoned looking at that.
Service Group Alert -- the purpose of this original question. There's nothing in the Notifications logic that lets me alert on service group availability.
Anyone accomplish this with Check_MK?
the NFS example will be tricky because there's no way for cross-host service dependencies automatically managed. You'll need a workaround there. You can monitor the exporting and nfs services properly (there's an nfsexports check, and you can also try checking rpcinfo connects) It'll leave a gap if i.e. a firewall fails, but if you monitor nfs well, focus on the server.
1) BI doesn't directly alert, there's a check_bi_aggr which you'd need to build the alerts on. (Using the service names that it will generate). Accordingly the notifications rules need to be configured for this one. It should quickly alert if you hit the 3/10 mark.
The notifications for the individual services should then be modified. i.e. you set them to not notify for long times. i.e. via a notification delay.
2) mostly useless for this, it'll be content until the last one failed
3) is basically a Nagios limitation, forget about this one.