I had a server lock up this morning. Here is a screen shot from the console:
None of the messages from the screen shot mean anything to me. I have a feeling that the important stuff probably scrolled off the console. I can not find any of the messages from the above screen capture in the syslog, message, dmesg, debug logs or anything logged at all at the time of the crash. Shouldn't this stuff have been logged?
This is a Debian box running Proxmox. uname output:
2.6.32-4-pve #1 SMP Mon May 9 12:59:57 CEST 2011 x86_64 GNU/Linux
The server has been online for about a year with no other crashes and it booted up again just fine.
I would love to figure out what the issue might have been so that we can prevent it from occurring again in the future. But, from the evidence I have so far, I don't even know if this was a hardware or software issue. Ideas?
Exactly which Debian kernel release do you run? You can see the full version and revision numbers if you do "dpkg -l | grep linux-image".
It looks like you're hitting a fairly prevalent bug that I've seen strike numerous times: In kernels before 3.2 mainline, before 2.6.32.50 stable and before Debian's 2.6.32-45 (based on 2.6.32.50 stable), there's a clock overflow that will strike after ~208 days of uptime, which will in turn enable the potential of crashing. I don't know exactly what can cause the crash after that time; the patch itself is pretty vague about it too:
I've seen upwards of hundred crashes due to this issue, before it was determined what caused it and the patch was deployed.
The bug was discussed at length in the lkml at the end of 2011. There could be a possible link to this divide by zero bug, but I haven't found any conclusion.
TL;DR: The likely fix is to upgrade to Debian's linux-image version 2.6.32-45 or later.
This is a screenshot of a kernel panic. The traceback is printed inside out, so whatever function finally killed the kernel is off the top of the screen, but starting from the top is a call to
divide_error()
inhpet_msi_next_event()
divide_error()
is defined in the kernel as a trap for FPE_INTDIV, so something inhpet_msi_next_event()
attempted to divide by zero.Unfortunately, the cause of that could be either hardware, software, or even just a transient bit flip error. (Are you using ECC ram?)