Part of my network estate has a fairly important dependency on a host whose availability is difficult to check. I have a number of hosts behind it, and my NAGIOS VPS provider occasionally has routing problems that cut off the provider where all these hosts are located. When it's unavailable I'd much prefer the hosts behind it to show UNAVAILABLE
than DOWN
, because they're not DOWN.
But its availability is difficult to detect, because it can't be PINGed
[me@nagios systems]$ ping -c 1 -w 1 205.251.232.153
[...]
1 packets transmitted, 0 received, 100% packet loss, time 1000ms
and there seem to be no network services on it that respond to queries:
[me@nagios systems]$ nmap -P0 -sT 205.251.232.153
[...]
All 1000 scanned ports on 205.251.232.153 are filtered
It does, however, participate in and respond to traceroute
s, which led me to discover that it will return ICMP-port-unreachable when I try to talk to a select range of UDP ports. This is the tcpdump
output while I do echo foo|nc -u 205.251.232.197 33459
:
[me@nagios systems]$ sudo tcpdump -n -n -i p1p1 host 205.251.232.197
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p1p1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:04:01.278269 IP a.b.c.d.36139 > 205.251.232.197.33459: UDP, length 4
15:04:01.448659 IP 205.251.232.197 > a.b.c.d: ICMP 205.251.232.197 udp port 33459 unreachable, length 36
So it seems to me that what I need is a test that emits a UDP packet to a host and port and regards ICMP-port-unreachable as evidence of success (hearing nothing constitutes failure). Does anyone know of a way to do this? How do others handle comparable monitoring problems?
No matter what protocol you use to check a hosts availability, if there are routing issues to a host, it's going to appear as down. If you want to check a hosts availability, and you don't want to enable ICMP, you could do a check_tcp or check_udp against any of the services you have running there. E.g. check_tcp -p 80 for HTTP or check_tcp -p 22 for ssh.
Although, it sounds like the greater problem you're trying to solve is to not alert for the hosts behind the gateway when the gateway is unreachable. This can be solved by defining dependencies in nagios. The hosts (or services) behind the gateway should depend on the gateway box. Then, if the gateway is down, it won't alert you for the other hosts. (http://nagios.sourceforge.net/docs/3_0/dependencies.html)
I finally and belatedly realised that if I can traceroute through a host, I should also be able to traceroute to that host, and on testing, verified that this is indeed the case.
All the traceroute-related plugins I could find on places like NAGIOS exchange are more sophisticated than this; they want to verify things like the identity of the first or second hop in the chain, and so on. All I want is a plugin that verifies that I can traceroute to a host and get a response. I found a plugin that (roughly) did that, and hacked it into shape for use with Linux (specifically, CentOS 6); it appears below in case it is of use to anyone.
This host has since become unavailable several times, and my NAGIOS has done the right thing: all the hosts the far side have alerted as UNAVAILABLE instead of DOWN.