I've set up a pacemaker/corosync ha-cluster in a failover configuration with two nodes: productive and standby. There are three DRBD partitions. Everything works fine so far.
I'm using Nagios NRPE on both nodes to monitor the server with icinga2 as reporting and visualizing tool. Now as the DRBD partitions on the standby node are not mounted until there is a failover switch I always get critical warnings for these:
Hence this is a false alert. I've already stumbled upon DISABLE_SVC_CHECK and tried to implement it, here is an example:
echo "[`date +%s`] DISABLE_SVC_CHECK;$host_name;$service_name" >> "/var/run/icinga2/cmd/icinga2.cmd"
Isn't there an easy way/best practice to disable this check for DRBD on the standby node in either Nagios or Icinga2? Of course I want this check to come into effect for the standby after a failover.
I would advise not monitoring this on the host directly. In our environment we utilize Pacemaker to automate failovers. One of the things Pacemaker does for us is moves an IP address upon failover. This ensures our clients are always pointing at the primary, and helps make failovers seem transparent from the client side.
For Nagios we monitor a slew of services on each host to keep an eye on things, but then we have an additional "host" configured for the virtual/floating IP address to monitor the DRBD devices and services that are only running on the primary.
In my environment, we manage multiple services running on top of drbd devices (traditional, lxc containers, docker containers, databases, ...). We use the opensvc stack (https://www.opensvc.com) which is free and opensource, and provides automatic failover features. Below is a test service with drbd, and a redis application (disabled in the example)
First at the cluster level, we can see in the
svcmon
output that :At the service level
svcmgr -s servdrbd print status
, we can see :To simulate an issue, I disconnected the drbd device on the secondary node, and that produce the following warnings
It is important to see that the service availability status is still up, but the overall service status is degraded to warn, meaning "ok, production is still running fine, but something goes wrong, have a look"
As soon as you are aware that all opensvc commands can be used with the json output selector (
nodemgr daemon status --format json
orsvcmgr -s servdrbd print status --format json
), it is easy to plug it into a NRPE script, and just monitor the service states. And as you saw, any issue on primary or secondary is trapped.The
nodemgr daemon status
is better because it is the same output on all cluster nodes, and all opensvc services informations are displayed in a single command call.If you are interested in service configuration file for this setup, I posted it on pastebin here
You could use check_multi to run both DRBD checks as a single Nagios check, and configure it to return OK if exactly one of the sub-checks is OK.
It gets tricky when you have to decide which host to attach the check too, though. You could attach it to a host using the VIP, or attach the check to both hosts, and use NRPE/ssh on each to check the other, etc.