I've got about 70 linux instances running on an OpenStack cluster that currently consists of two compute nodes and one controller. Also, these machines live in a RackSpace DC as part of their 'Private Cloud' program, so all of our resources are dedicated.
Previously we were using only RackSpace's NTP servers to synchronize the clocks on all of our instances, but Check_MK was frequently notifying us that the instances were syncing to themselves [stratum 10], implying that the NTP servers were not responding. Given that only 4/70+ instances had public IP addresses I assumed that RackSpace's NTP servers were ratelimiting us since they would be seeing 35+ times the normal rate of NTP queries originating from our two compute hosts. This seemed logical since the 4 instances with public IPs never generated any complaints about NTP.
To address this I changed ntpd.conf on our instances to include our controller node alongside the Rackspace servers so we would at least have a fallback when the RS servers stopped responding. [the NTP cookbook we are using does not allow us to set a preference] However, this has not stopped, or even reduced the number of NTP complaints. I've been seeing last
entries in ntpq -p
in excess of 60 minutes for all three hosts. I can't see how rate IP-based rate limiting might be coming into effect with the controller node since the instances and the controller reside on, and communicate through, a private network where every instance has its own IP address.
What could be causing this? As far as I've been able to tell there is nothing in the restrict default
line that would cause what we're experiencing.
ntp.conf
from an instance:
driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server controller01.dfw.domain.com iburst
restrict controller01.dfw.domain.com nomodify notrap noquery
server time.dfw1.rackspace.com iburst
restrict time.dfw1.rackspace.com nomodify notrap noquery
server time2.dfw1.rackspace.com iburst
restrict time2.dfw1.rackspace.com nomodify notrap noquery
restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10
ntp.conf
from the controller node:
driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server 0.pool.ntp.org iburst
restrict 0.pool.ntp.org nomodify notrap noquery
server 1.pool.ntp.org iburst
restrict 1.pool.ntp.org nomodify notrap noquery
server 2.pool.ntp.org iburst
restrict 2.pool.ntp.org nomodify notrap noquery
server 3.pool.ntp.org iburst
restrict 3.pool.ntp.org nomodify notrap noquery
restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10
- Controller node OS is Ubuntu 12.04.3 LTS running ntpd 4.2.6p3
- Instance OSes are Centos 6.4/6.5 running ntpd 4.2.4p8/4.2.6p5
Edit:
Controller:
# ntpq -npcrv
remote refid st t when poll reach delay offset jitter
==============================================================================
+66.79.167.34 129.6.15.28 2 u 933 1024 377 50.360 3.898 5.064
-208.53.158.34 164.244.221.197 2 u 372 1024 377 27.384 6.635 5.323
+173.230.158.30 199.102.46.73 2 u 780 1024 357 47.656 0.897 0.596
*129.250.35.251 209.51.161.238 2 u 373 1024 377 40.828 1.786 0.163
127.127.1.0 .LOCL. 10 l 84d 64 0 0.000 0.000 0.000
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd [email protected] Tue Jun 5 20:12:08 UTC 2012 (1)",
processor="x86_64", system="Linux/3.2.0-54-generic", leap=00, stratum=3,
precision=-22, rootdelay=48.228, rootdisp=69.214, refid=129.250.35.251,
reftime=d6f049cf.5ce03f06 Wed, Apr 9 2014 22:35:59.362,
clock=d6f04f81.183edd61 Wed, Apr 9 2014 23:00:17.094, peer=21729,
tc=10, mintc=3, offset=1.514, frequency=12.879, sys_jitter=1.158,
clk_jitter=0.896, clk_wander=0.058
Instance:
$ ntpq -npcrv
remote refid st t when poll reach delay offset jitter
==============================================================================
+10.240.0.81 129.250.35.251 3 u 1997 1024 376 0.461 -2.098 0.194
+72.3.128.240 204.9.54.119 2 u 1556 1024 376 0.677 2.234 4.023
*72.3.128.241 204.9.54.119 2 u 1664 1024 376 0.793 -0.783 0.836
127.127.1.0 .LOCL. 10 l 51h 64 0 0.000 0.000 0.000
associd=0 status=06ff leap_none, sync_ntp, 15 events, stale_leapsecond_values,
version="ntpd [email protected] Sat Nov 23 18:21:48 UTC 2013 (1)",
processor="x86_64", system="Linux/2.6.32-431.5.1.el6.x86_64", leap=00,
stratum=3, precision=-22, rootdelay=30.593, rootdisp=105.114,
refid=72.3.128.241,
reftime=d6f04951.9026bd89 Wed, Apr 9 2014 22:33:53.563,
clock=d6f04fd1.0d15b2be Wed, Apr 9 2014 23:01:37.051, peer=54008,
tc=10, mintc=3, offset=-0.295, frequency=-0.163, sys_jitter=1.914,
clk_jitter=0.918, clk_wander=0.080, tai=35, leapsec=201207010000,
expire=201306280000
0 Answers