During past month one of my Debian Squeeze (Linux 2.6.32-bpo.5-amd64) machines did lock up twice, hard. No response to ARP, dark console, Caps Lock, Num Lock not working, Magic SysRq ineffective. Changing the kernel to 3.2.0-0.bpo.2-amd64 from backports didn't help either.
Temperature and load monitoring doesn't show any spikes before crash.
How should I diagnose and debug such problem?
Is netconsole my only bet?
EDIT: I've already disabled screen blanking:
#/etc/console-tools/config
BLANK_TIME=0
POWERDOWN_TIME=0
and
setterm -blank 0
on physical console.
UPDATE:
This time it locked, the screen was still showing login prompt. Since last problems I've run a 6h load test with BOINC (Prime 95) test without any problem.
I've found two possible solutions, I'll report if they worked. EDIT: They didn't
First is nmi_watchdog enabled by adding
nmi_watchdog=1
to kernel boot parameters.The second one (thanks @womble for the suggestion) was forcing ECC on by
Unfortunately, support for ECC DDR3 memory in 2.6.32-bpo.5-amd64 (Debian squeeze) kernel is absent, I had to use 3.2 from backports.
I also added those options to general kernel parameters:
As the hangs were happening more and more often, the problem was probably caused by faulty mainboard or less likely, the CPU. After replacing those components the problems went away.