We occasionally have seen time differences on our servers, and confirmed that:
- ntpd crashed without any traceable logs
- ntpq process was dead, but pid existed at /var/run/ntpd.pid
/etc/init.d/ntp restart
thenntpq -p
, problem solved
At first, ntpq -p
returned ntpq: read: Connection refused
, so I went ahead and ps aux | grep ntp
returned no ntp process, while other working hosts returned something like /usr/sbin/ntpd -p /var/run/ntpd.pid -u 101:103 -g
. It seemed that ntpd actually crashed since no logs seen in /var/log/messages, but it's possible that it happened too long ago and that part in the log was already rotated.
So I went on to /etc/init.d/ntp restart
and was told that the stale pid existed:
Stopping NTP server: ntpdstart-stop-daemon: warning: failed to kill 2124: No such process`.
Starting NTP server: ntpd.
but everything got back in place.
We're on Debian 6 Squeeze but the problem has been around since Debian 5 Lenny. We installed ntp using aptitude install ntp
. Servers are on Linode VPS (= Xen virtualization), so we asked them but they said they had no experience like this.
Also, though I don't know if it's just a coincidence or not, it seems that it happens only on application servers (Ruby on Rails) so far.
Thing is, since the pid file remains when ntpd crashes, it's pretty hard to detect the crash and restart with monit or alike. Should I call /etc/init.d/ntp restart
every once in a while by cron?
Any experiences, solutions, thoughts?
If you're using monit, their FAQ says that monit checks to make sure that the pid in the pid file is valid in order to detect situations where the program crashes and leaves its pid file behind.
If you're not using monit, then perhaps you can find a monitoring script that communicates with ntpd directly (nagios has several ntp plugins that you might be able to use/reuse)? If you can't communicate with it, then it has probably crashed.