What is the strategy for detecting time drift in all linux based data centre? This is a more difficult problem than it seems at first.
Time drift can cause serious problems for certain applications and often, even though NTP is installed, it's possible to fail for the following (and many more) reasons:
- NTP was not correctly set up to automatically restart on reboot.
- The settings on a server are incorrect so the time server it points to is unreachable or inaccurate.
- The master time server is unreachable and all servers are syncing with it are now syncing to an unreliable source.
I would like a way to detect if all the individual servers are correct. Bear in mind that the server with the testing script/application may not be right.
This is easy to control. Configuration management is the key...
Ensure that the ntp service is running and configured...
For example, using Monit to make sure
ntpd
is running and to restart it if it fails is an easy approach... It may make sense to add cron and other essential daemons to that sort of check.Another option is using a configuration management tool like Puppet to force the same ntpd.conf to your servers and ensure that ntpd is installed, configured and running.
There are enough redundancies in the NTP protocol to deal with the instance of a time server being unreachable. Specify multiple sources.
There are a variety of check_ntp plugins for nagios out there.
Here's one:
http://nagiosplugins.org/man/check_ntp
Add this check to your nagios host and get alerts if anything goes awry.