*NOTE: if your server still has issues due to confused kernels, and you can't reboot - the simplest solution proposed with gnu date installed on your system is: date -s now. This will reset the kernel's internal "time_was_set" variable and fix the CPU hogging futex loops in java and other userspace tools. I have straced this command on my own system an confirmed it's doing what it says on the tin *
POSTMORTEM
Anticlimax: only thing that died was my VPN (openvpn) link to the cluster, so there was an exciting few seconds while it re-established. Everything else was fine, and starting up ntp went cleanly after the leap second had passed.
I have written up my full experience of the day at http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/
If you look at Marco's blog at http://my.opera.com/marcomarongiu/blog/2012/06/01/an-humble-attempt-to-work-around-the-leap-second - he has a solution for phasing the time change over 24 hours using ntpd -x to avoid the 1 second skip. This is an alternative smearing method to running your own ntp infrastructure.
Just today, Sat June 30th, 2012 - starting soon after the start of the day GMT. We've had a handful of servers in different datacentres as managed by different teams all go dark - not responding to pings, screen blank.
They're all running Debian Squeeze - with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I've also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I'm wondering.
The one crash which I did get a screen dump from said:
[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358
[3161000.864001] lock: ffff88083fc0d740, .magic: dead4ead, .owner: imapd/24737, .owner_cpu: 0
Unfortunately the blades all supposedly had kdump configured, but they died so hard that kdump didn't trigger - and they had console blanking turned on. I've disabled console blanking now, so fingers crossed I'll have more information after the next crash.
Just want to know if it's a common thread or "just us". It's really odd that they're different units in different datacentres bought at different times and run by different admins (I run the FastMail.FM ones)... and now even different vendor hardware. Most of the machines which crashed had been up for weeks/months and were running 3.1 or 3.2 series kernels.
The most recent crash was a machine which had only been up about 6 hours running 3.2.21.
THE WORKAROUND
Ok people, here's how I worked around it.
- disabled ntp:
/etc/init.d/ntp stop
- created http://linux.brong.fastmail.fm/2012-06-30/fixtime.pl (code stolen from Marco, see blog posts in comments)
- ran
fixtime.pl
without an argument to see that there was a leap second set - ran
fixtime.pl
with an argument to remove the leap second
NOTE: depends on adjtimex
. I've put a copy of the squeeze adjtimex
binary at http://linux.brong.fastmail.fm/2012-06-30/adjtimex — it will run without dependencies on a squeeze 64 bit system. If you put it in the same directory as fixtime.pl
, it will be used if the system one isn't present. Obviously if you don't have squeeze 64-bit… find your own.
I'm going to start ntp
again tomorrow.
As an anonymous user suggested - an alternative to running adjtimex
is to just set the time yourself, which will presumably also clear the leapsecond counter.