Ping a Specific Port

Question

digijay

Asked: 2018-11-03 06:29:29 +0800 CST2018-11-03 06:29:29 +0800 CST 2018-11-03 06:29:29 +0800 CST

Nagios/Icinga: Don't show CRITICAL for DRBD partitions on standby node

772

I've set up a pacemaker/corosync ha-cluster in a failover configuration with two nodes: productive and standby. There are three DRBD partitions. Everything works fine so far.

I'm using Nagios NRPE on both nodes to monitor the server with icinga2 as reporting and visualizing tool. Now as the DRBD partitions on the standby node are not mounted until there is a failover switch I always get critical warnings for these:

Hence this is a false alert. I've already stumbled upon DISABLE_SVC_CHECK and tried to implement it, here is an example:

echo "[`date +%s`] DISABLE_SVC_CHECK;$host_name;$service_name" >> "/var/run/icinga2/cmd/icinga2.cmd"

Isn't there an easy way/best practice to disable this check for DRBD on the standby node in either Nagios or Icinga2? Of course I want this check to come into effect for the standby after a failover.

3 Answers

Voted

Dok · Answer 1 · 2018-11-03T07:32:23+08:00

Best Answer

Dok

2018-11-03T07:32:23+08:002018-11-03T07:32:23+08:00

I would advise not monitoring this on the host directly. In our environment we utilize Pacemaker to automate failovers. One of the things Pacemaker does for us is moves an IP address upon failover. This ensures our clients are always pointing at the primary, and helps make failovers seem transparent from the client side.

For Nagios we monitor a slew of services on each host to keep an eye on things, but then we have an additional "host" configured for the virtual/floating IP address to monitor the DRBD devices and services that are only running on the primary.

2

Robert Dedieu · Answer 2 · 2018-11-06T08:47:09+08:00

In my environment, we manage multiple services running on top of drbd devices (traditional, lxc containers, docker containers, databases, ...). We use the opensvc stack (https://www.opensvc.com) which is free and opensource, and provides automatic failover features. Below is a test service with drbd, and a redis application (disabled in the example)

First at the cluster level, we can see in the svcmon output that :

2 nodes opensvc cluster (node-1-1 and node-1-2)
service servdrbd is up (uppercase green O) on node-1-1, and standby (lowercase green o) on node-1-2
node-1-1 is the preferred master node for this service (circumflex accent close to uppercase O)

At the service level svcmgr -s servdrbd print status, we can see :

on the primary node (on the left) : we can see that all ressources are up (or standby up; meaning they must remain up when service is running on the other node). And concerning drbd device, it is reported as Primary
on the secondary node (on the right) : we can see that only standby ressources are up, and the drbd device is in Secondary state.

To simulate an issue, I disconnected the drbd device on the secondary node, and that produce the following warnings

It is important to see that the service availability status is still up, but the overall service status is degraded to warn, meaning "ok, production is still running fine, but something goes wrong, have a look"

As soon as you are aware that all opensvc commands can be used with the json output selector (nodemgr daemon status --format json or svcmgr -s servdrbd print status --format json), it is easy to plug it into a NRPE script, and just monitor the service states. And as you saw, any issue on primary or secondary is trapped.

The nodemgr daemon status is better because it is the same output on all cluster nodes, and all opensvc services informations are displayed in a single command call.

If you are interested in service configuration file for this setup, I posted it on pastebin here

Keith · Answer 3 · 2018-11-06T12:44:16+08:00

Keith

2018-11-06T12:44:16+08:002018-11-06T12:44:16+08:00

You could use check_multi to run both DRBD checks as a single Nagios check, and configure it to return OK if exactly one of the sub-checks is OK.

It gets tricky when you have to decide which host to attach the check too, though. You could attach it to a host using the VIP, or attach the check to both hosts, and use NRPE/ssh on each to check the other, etc.

1

Nagios/Icinga: Don't show CRITICAL for DRBD partitions on standby node

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?