A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)
So after the reset, it turns out that every trace of disk IO activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.
Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.
What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.
I am rounding up the disk checking tools right now - but there must be more going on!
- Mount options:
/dev/sda1 on / type ext3 (rw,errors=remount-ro)
- Kernel:
Linux dev 2.6.32-5-686-bigmem
- Disk/Inodes:
13%/3%
Sounds familiar to me. Do you have an Intel-CPU? If so, what are your green mode-settings in the BIOS? Is your BIOS up to date?
What Intel-Microcode-patch does your Debian apply during boot?
I had similar situations where an R310 froze up (weekends during times where nothing happened). This was fixed by an Intel-microcode update (CentOS 5 in my case).
Dell recommended a BIOS-upgrade, which in turn applied the same microcode update.
In other cases I have seen Intel-C-sleep-states to be responsible.
If you don't have an OOPS message from the kernel as to why it locked up then you aren't going to be able to troubleshoot much further. You might be able to set up kdump to save some debug output should it happen again and you could run memtest86 or some other hardware diagnostics but without further information you can't move forward.