Ping a Specific Port

Question

thinice

Asked: 2011-06-21 11:44:08 +0800 CST2011-06-21 11:44:08 +0800 CST 2011-06-21 11:44:08 +0800 CST

Server hang - data loss on reboot, post mortem analysis

772

A development server I'm responsible for (ext3 on raid 5 w/Debian Squeeze) froze up over the weekend and I was forced to reset it, as in unresponsive from KVM/physical keyboard access, no eth devices responding, etc. Not even the backup process ran (Figures, the one time I don't check for confirmation)

So after the reset, it turns out that every trace of ~~disk IO~~ activity that should have happened for a period of ~24H is completely gone. The log files have a big gap in the dates and times. As if the writes were never committed to disk, no processes seemed to have run.

Luckily it was a weekend and nothing of value would have been lost and I don't suspect a hack.

What can I do in post mortem to this event - to prevent it from ever happening again? I've seen this happen before on a completely different machine running FreeBSD.

I am rounding up the disk checking tools right now - but there must be more going on!

Mount options: /dev/sda1 on / type ext3 (rw,errors=remount-ro)
Kernel: Linux dev 2.6.32-5-686-bigmem
Disk/Inodes: 13%/3%

2 Answers

Voted

Nils · Answer 1 · 2011-06-24T13:05:25+08:00

Best Answer

Nils

2011-06-24T13:05:25+08:002011-06-24T13:05:25+08:00

Sounds familiar to me. Do you have an Intel-CPU? If so, what are your green mode-settings in the BIOS? Is your BIOS up to date?

What Intel-Microcode-patch does your Debian apply during boot?

I had similar situations where an R310 froze up (weekends during times where nothing happened). This was fixed by an Intel-microcode update (CentOS 5 in my case).

Dell recommended a BIOS-upgrade, which in turn applied the same microcode update.

In other cases I have seen Intel-C-sleep-states to be responsible.

1

mtinberg · Answer 2 · 2011-06-25T14:56:37+08:00

mtinberg

2011-06-25T14:56:37+08:002011-06-25T14:56:37+08:00

If you don't have an OOPS message from the kernel as to why it locked up then you aren't going to be able to troubleshoot much further. You might be able to set up kdump to save some debug output should it happen again and you could run memtest86 or some other hardware diagnostics but without further information you can't move forward.

1

Server hang - data loss on reboot, post mortem analysis

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?