We have a bit of a challenge with our Nagios check_icmp
monitors... our network suffers from microbursts which can drop maybe 1 or 2 milliseconds of traffic through our firewall. We are working on the issue with microbursts through the firewall, but the microbursts are actually triggering false host down alarms from nagios...
Sun Jul 14 00:00:37 CDT 2013 [1373778037] HOST ALERT: host1;DOWN;SOFT;1;CRITICAL - 105.195.240.6: rta nan, lost 100%
Sun Jul 14 00:00:37 CDT 2013 [1373778037] HOST ALERT: host2;DOWN;SOFT;1;CRITICAL - 105.195.115.33: rta nan, lost 100%
Sun Jul 14 00:00:37 CDT 2013 [1373778037] HOST ALERT: host3;DOWN;SOFT;1;CRITICAL - 105.193.26.8: rta nan, lost 100%
Sun Jul 14 00:00:37 CDT 2013 [1373778037] HOST ALERT: host4;DOWN;SOFT;1;CRITICAL - 105.193.221.73: rta nan, lost 100%
Sun Jul 14 00:00:37 CDT 2013 [1373778037] HOST ALERT: host5;DOWN;SOFT;1;CRITICAL - 105.194.18.91: rta nan, lost 100%
The reason is that check_icmp
uses absurd inter-packet spacing defaults... the default packet spacing is so low that the entire ping cycle can fit inside the space of one microburst through the firewall... this is what we see when we use check_icmp -n 5 -t 3 -v 10.19.26.29
[mpenning@target1 ~]$ sudo tshark -i eth0 icmp and host nagios.domain.local
[sudo] password for mpenning:
Running as user "root" and group "root". This could be dangerous.
Capturing on eth0
0.000000 10.19.20.16 -> 10.19.26.29 ICMP Echo (ping) request
0.000028 10.19.26.29 -> 10.19.20.16 ICMP Echo (ping) reply
0.000348 10.19.20.16 -> 10.19.26.29 ICMP Echo (ping) request
0.000358 10.19.26.29 -> 10.19.20.16 ICMP Echo (ping) reply
0.000572 10.19.20.16 -> 10.19.26.29 ICMP Echo (ping) request
0.000581 10.19.26.29 -> 10.19.20.16 ICMP Echo (ping) reply
0.000792 10.19.20.16 -> 10.19.26.29 ICMP Echo (ping) request
0.000801 10.19.26.29 -> 10.19.20.16 ICMP Echo (ping) reply
0.001017 10.19.20.16 -> 10.19.26.29 ICMP Echo (ping) request
0.001025 10.19.26.29 -> 10.19.20.16 ICMP Echo (ping) reply
While check_icmp
has a -i
switch that allegedly controls inter-packet spacing, it doesn't allow 500ms packet spacing for some reason... even when I run it as check_icmp -n 5 -t 3 -i 2000 -v 10.19.26.29
, the timing doesn't substantially change...
[mpenning@target1 ~]$ sudo tshark -i eth0 icmp and host nagios.domain.local
Running as user "root" and group "root". This could be dangerous.
Capturing on eth0
0.000000 10.19.20.16 -> 105.19.26.29 ICMP Echo (ping) request
0.000018 10.19.26.29 -> 105.19.20.16 ICMP Echo (ping) reply
0.000327 10.19.20.16 -> 105.19.26.29 ICMP Echo (ping) request
0.000338 10.19.26.29 -> 105.19.20.16 ICMP Echo (ping) reply
0.000540 10.19.20.16 -> 105.19.26.29 ICMP Echo (ping) request
0.000552 10.19.26.29 -> 105.19.20.16 ICMP Echo (ping) reply
0.000813 10.19.20.16 -> 105.19.26.29 ICMP Echo (ping) request
0.000824 10.19.26.29 -> 105.19.20.16 ICMP Echo (ping) reply
0.001075 10.19.20.16 -> 105.19.26.29 ICMP Echo (ping) request
0.001087 10.19.26.29 -> 105.19.20.16 ICMP Echo (ping) reply
Is there a way to force nagios' check_icmp
or check_ping
methods to increase their packet spacing to 500ms between pings? I realize I could ask nagios to send 5000 pings per host, but that seems like a real waste of system and network resources just to work around this problem.
check_icmp offers several command-line tweaks that may help. Run check_icmp -h from the commandline for more.
From my understanding
-i max packet interval (currently 80.000ms)
-i 2000 (2.000ms)
-i 80000 (80.000ms)
-i 500000 (500.000ms)