I have a rather simple setup where I have two computers:
Computer A. has a normal NTP setup and uses the standard Internet sources (As per Ubuntu) to determine time. It also allows for query on IP 10.0.2.0/24
:
restrict 10.0.2.0 mask 255.255.255.0 nomodify notrap
Computer B. has a normal NTP setup, except for all the sources are changed to use 10.0.2.1
(which is Computer A).
Once in a while, Computer A gets a Kiss-of-Death signal from one of its source. As a result, it totally kills Computer B's NTP (i.e. it looks like the KoD is transmitted directly).
Is there a way to know the state of an NTP server in terms of whether it will just be sending KoD message or not? (also, how do I get out of that situation? When I looked at it, all the IP addresses shown in the log were not used by the server?! so I don't understand why it insists on sending KoD to its client).
I've found two things so far:
ntpq
I can run
ntpq
like so:ntpq -pn
When the NTP server works, I can see an asterisk in front of the IP address the computer is happy about. In my case, all of the status flags (first column
+
,-
,*
,#
, etc.) all disappear. As far as I know, that means the NTP service is not happy and no synchronization is being performed.Here is an example when it still works (i.e. there are flags in the very first column):
remote refid st t when poll reach delay offset jitter ============================================================================== 10.0.2.255 .BCST. 16 B - 64 0 0.000 0.000 0.000 #51.77.203.211 134.59.1.5 3 u 4 64 1 171.248 -743.64 691.917 +72.5.72.15 216.218.254.202 2 u 2 64 1 19.223 -778.34 686.200 +159.69.25.180 192.53.103.103 2 u 3 64 1 237.733 -775.41 701.376 +173.0.48.220 43.77.130.254 2 u 2 64 1 35.489 -778.85 669.187 38.229.56.9 172.16.21.35 2 u 31 64 1 153.976 -268.90 122.557 +137.190.2.4 .PPS. 1 u 31 64 1 93.797 -253.69 116.289 +150.136.0.232 185.125.206.71 3 u 35 64 1 95.667 -178.19 114.912 94.154.96.7 194.29.130.252 2 u 31 64 1 237.560 -231.88 107.230 +162.159.200.123 10.4.1.175 3 u 34 64 1 16.246 -199.68 115.561 *216.218.254.202 .CDMA. 1 u 35 64 1 52.906 -193.84 131.148 91.189.91.157 132.163.96.1 2 u 45 64 1 87.772 -5.716 0.000 +204.2.134.163 44.24.199.34 3 u 34 64 1 16.711 -199.12 116.777 +74.6.168.73 208.71.46.33 2 u 35 64 1 69.772 -189.21 128.119 91.189.89.199 17.253.34.123 2 u 45 64 1 165.471 -3.708 0.000 +216.229.0.49 216.218.192.202 2 u 35 64 1 71.437 -178.94 97.505 91.189.89.198 17.253.34.123 2 u 44 64 1 172.852 -17.899 0.000
ntpdate -q <ip>
The
ntpdate
command will actually tell me whether the NTP is accepting packets. This is because it gives an error message if not:$ sudo ntpdate -q 10.0.2.1 server 10.0.2.1, stratum 4, offset 5.194725, delay 0.02652 21 Jul 15:22:48 ntpdate[13086]: no server suitable for synchronization found
This happens after a little while when my main server loses the
*
status on the one server it was first happy to synchronize with...
Now... I still need to understand what I have to do to fix this issue...
This may be helpful, here are the logs about a restart from a full reboot:
Jul 21 18:29:13 vm-ve-ctl kernel: [ 434.275481] audit: type=1400 audit(1626917353.636:43): apparmor="DENIED" operation="open" profile="/usr/sbin/ntp
d" name="/snap/bin/" pid=3896 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: ntpd [email protected] (1): Starting
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 126:129
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: proto: precision = 0.190 usec (-22)
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Cannot open logfile /var/log/ntp.log: Permission denied
Jul 21 18:29:13 vm-ve-ctl kernel: [ 434.291490] audit: type=1400 audit(1626917353.652:44): apparmor="DENIED" operation="capable" profile="/usr/sbin/
ntpd" pid=3901 comm="ntpd" capability=1 capname="dac_override"
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2021-12-28T00:00:00Z last=2017-01-01T
00:00:00Z ofs=37
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 0 v6wildcard [::]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 2 lo 127.0.0.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 3 enp0s3 192.168.2.120:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 4 enp0s8 10.0.2.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 5 lo [::1]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 6 enp0s3 [fe80::a00:27ff:fe25:38ff%2]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 7 enp0s8 [fe80::a00:27ff:fe35:c30b%3]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listening on routing socket on fd #24 for interface updates
Jul 21 18:29:14 vm-ve-ctl ntpd[3901]: Soliciting pool server 51.77.203.211
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 159.69.25.180
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 72.5.72.15
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 198.251.86.68
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 173.0.48.220
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 38.229.56.9
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 150.136.0.232
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 94.154.96.7
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 137.190.2.4
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 162.159.200.123
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.218.254.202
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.91.157
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.199
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 74.6.168.73
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 204.2.134.163
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.198
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.229.0.49
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 2604:ed40:1000:1711:d862:f5ff:fe4e:41c4
Jul 21 18:29:21 vm-ve-ctl ntpd[3901]: receive: Unexpected origin timestamp 0xe4a34871.ac57f05d does not match aorg 0000000000.00000000 from [email protected] xmt 0xe4a34871.65648c54
I do not know exactly when it starts going bad. I've also seen the following which I thought could have something to do with it (i.e. when that happens, the corresponding IP is removed from the list!), but it is already bad now and no such error occurred in my last run.
Jul 21 18:08:57 vm-ve-ctl ntpd[9764]: 92.243.6.5 local addr 192.168.2.120 -> <null>
Note: the 192.168.2.120 is the IP of the failing computer. It is a VirtualBox. It has been working for months... though, maybe something changed which makes it unhappy.
I found this post about an issue with the ... -> <null>
message. However, I think we have a newer version on Ubuntu 18.04:
SUSE minimum recommended version: ntp-4.2.8p7-11.1
Ubuntu 18.04 version: 1:4.2.8p10+dfsg-5ubuntu7.3
Just in case, I tried to connect the VM to the host and I still get a huge offset and jitter. What could have changed?!
remote refid st t when poll reach delay offset jitter
==============================================================================
10.0.2.10 .POOL. 16 p - 64 0 0.000 0.000 0.000
10.0.2.10 132.163.97.6 2 u 54 64 3 0.457 -5254.2 3917.68
As asked by Paul Gear, here is the output of ntpq with additional details:
$ ntpq -ncrv
associd=0 status=0028 leap_none, sync_unspec, 2 events, no_sys_peer,
version="ntpd [email protected] (1)", processor="x86_64",
system="Linux/4.15.0-151-generic", leap=00, stratum=4, precision=-23,
rootdelay=17.930, rootdisp=5019.260, refid=173.255.215.209,
reftime=e4a44f7a.1c2ec778 Thu, Jul 22 2021 13:11:38.110,
clock=e4a45030.c8a4b259 Thu, Jul 22 2021 13:14:40.783, peer=0, tc=6,
mintc=3, offset=-109.527915, frequency=-1.707, sys_jitter=0.000000,
clk_jitter=38.724, clk_wander=0.000, tai=37, leapsec=201701010000,
expire=202112280000
Here is the list of available clocks and the one currently used:
$ grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource:kvm-clock tsc acpi_pm
/sys/devices/system/clocksource/clocksource0/current_clocksource:kvm-clock
And finally, the dmesg
output about the clocksource selection process:
$ dmesg | grep clocksource
[ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.283117] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 1.161844] clocksource: Switched to clocksource kvm-clock
[ 1.208316] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 2.329228] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1db81a3240f, max_idle_ns: 440795250379 ns
It looks like I found a solution. I'm not too sure why it would have been working before, though.
After many searches, I found this VirtualBox ticket:
https://www.virtualbox.org/ticket/15179
which says that you shouldn't use
ntpd
because the VM already does keep time using the Host time to adjust the VM's virtual clock.However, even withouth
ntpd
running, my VM's clock was off and would bounce between ±30 seconds in a short period of time.Reading that post further, they offered to adjust the paravirtualization settings to "None". That didn't seem to work for me. I restarted the VM and it got stuck. So I instead tried with "Minimal" and now the clock is working. The
ntpq -np
output looks a lot better:As we can see, the Offset (max. 4.3) and Jitter (max. 6.7) are minuscule compare to what I was getting before. Now my other computer also works and can synchronize itself to this clock. The delay is around 0.7, which is enough for me (in my case, enough is 12.0 or less).
IMPORTANT NOTE: I rebooted that VM 2 or 3 times using Minimal, the boot time is excruciatingly long. So I do not recommend using this mode except as a the very last resort if you absolutely need a working system clock. If you have a better solution that works, I'm all ears!
Just in case, I wanted to see the difference in regard to the clock sources in Minimal mode. We just get the
acpi_pm
clock.Now I'm wondering whether this could have an effect on the host clock. Since I also have ntp on my host, I don't think this is an issue.
Okay, rather unusual, I'm adding a second answer because this 100% explains why the started breaking. Within a few days after I last rebooted, there is an NVidia driver upgrade. Then I started my VM (i.e. the order is very important here!)
I didn't know that the 3D environment could be missing and if not accelerated, then the VM would totally misbehave in terms of clock/time.
Note that the fact that the 3D environment is not available was not visible to me. It may have been mentioned in some logs, but as a standard user, I missed that part completely.
After a full reboot (required by that NVidia upgrade), the VM works normally again. No need to use the Minimal option. I put that VM back to Default and it works nicely all the way as before.
I hope this saves a few people the headache I had for 3 days...
For those interested, changing the clock may also help. Oracle has a nice page about how to change the clock source. On my end, I ended up using the
apci_pm
which seems to help greatly in maintaining time for a long time.You can also update your boot parameters so that way each time you boot you get that selected source.
(I removed irrelevant parameters here, don't delete them, just add the
clocksource
parameter; it may also be that the variable is empty the first time you edit it:GRUB_CMDLINE_LINUX=""
).