I have a local NTP server running on the subnet to keep other subnet nodes in sync, without every node syncing with upstream servers. But, while implementing the check_ntp_time
plugin for Nagios I am noticing a frustrating issue, where nagios keeps reporting critical error for local nodes syncing up with the local ntp server.
Here is the ntp config on the local ntp server, notice the upstream server entries and the restrict entry, according to my research this qualifies the node as an ntp server which local nodes can sync against.
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod limited nomodify notrap nopeer noquery
restrict -6 default kod limited nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1
# Makes me able to answer requests from local nodes
restrict 10.0.0.0 mask 255.255.192.0 nomodify notrap
# My source
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org
logfile /var/log/ntp/server.log
statistics loopstats
statsdir /var/log/ntp/
filegen peerstats file peers type day link enable
filegen loopstats file loops type day link enable
And on the local non-ntp server nodes, everything is the same except the restrict entry is removed, and the server entries reference only the local ntp server: server ntp.example.com iburst
.
Every local node can resolve ntp.example.com
.
The problem I am having is when I run the following command from the nagios server:
/usr/lib64/nagios/plugins/check_ntp_time -H node-a-1 -v
And the output:
sending request to peer 0
response from peer 0: offset -0.002921819687
sending request to peer 0
response from peer 0: offset -0.0001939535141
sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
discarding peer 0: stratum=0
overall average offset: 0
NTP CRITICAL: Offset unknown|
This happens for all the nodes, except the local ntp server, which references upstream servers. At first I thought it was IPTables issue, but I have the ports pinholed on every local ntp node (to allow nagios access to check the time diff):
ACCEPT udp -- eth0 * 10.0.0.0/18 0.0.0.0/0 multiport dports 123 /* 777 allow ntp access */ state NEW
Versions:
nagios-plugins-ntp: 1.4.16
ntp: 4.2.6p5-1.el6.centos
Any help is greatly appreciated, I really can't submit the nagios work until I get this resolved, as you know keeping server times in sync is priority 1.
-- Edit --
Per the comments, here are the results of ntpq -p
, on various nodes:
# Actual NTP Server (10.0.0.2)
==============================================================================
+propjet.latt.ne 241.199.164.101 2 u 105 128 337 14.578 12.954 7.138
+x2la01.hostigat 63.145.169.2 3 u 21 128 377 16.037 13.546 4.090
*pacific.latt.ne 241.199.164.101 2 u 72 128 377 15.148 24.434 7.403
# Local node 1
==============================================================================
*service-a-1.sn1 204.2.134.163 3 u 9 128 377 0.228 5.217 1.296
# Local node 2
==============================================================================
*service-a-1.sn1 204.2.134.163 3 u 91 128 377 0.200 3.608 1.167
The key line here is this one:
An NTP server identifying itself as stratum 0 is a violation of the spec (it's reserved for atomic clocks or something like that). I had this problem years ago with some BSD and Mac OS X hosts. I ended up hacking the stratum check out of the source and maintaining a separate build of the plugin for "problematic" hosts.
The offending lines are 254-257 (currently, anyway), if you want to rip that out. It's a hack, but it works for me ;-)
I found this thread in the mailing list archives about it. I think there was another one where I suggested adding a command-line option to ignore the stratum check, but I don't think it got any traction.
There's also a bug report about it, but it hasn't yielded anything useful as far as I can tell.
I removed the problem be disabling the KOD (kiss-of-death) feature on the NTP server.
check_ntp sends (at least) 4 requests in quick succession to calculate a statistically sound average offset. The third and all following requests are considered a denial of service attack by the server and are answered with a KOD message (invalid stratum, namely 0). In fact, this behaviour should be considered a bug of check_ntp as KOD must be processed properly by the client.