I have a CentOS 6.4 server on an oldish box (HP ProLiant ML350 G4) that I've recently installed on and it's been up less than 24 hours. It has 6 146 GB 10k SCSI drives in RAID 1+0 that are also brand new and no signs of drive failure or any type of hardware notices. Yet, randomly this will happen:
Once this occurs, I can't log in (this occurs at the login prompt) and SSH is not responding. Ping is responding, but otherwise the box is locked up tight. Note that a reboot solves the issue for a brief time, but this has occured at least 3 different times on CentOS 6.4 and Debian 6, both clean installs.
Anyone have any insight?
Edit: Logs after the fact show nothing (not even the messages mentioned).
Firmware on HP gear is always important. The Smart Array 6400 and 641/642 controllers of that era (2003-2005) used to freeze and do all sorts of funky things in certain situations. Update the firmware to the most recent.
On the Linux side, the CCISS block device driver has been in the kernel for ages. It's typically stable. But there are some other considerations on EL6. Adjust your I/O scheduler or use the
tuned-adm
utility. Make sure you have a battery-backed cache on that controller if you're doing any write-heavy activity. And run a health status check on the controller with thehpacucli ctrl all show config detail
command. Just make sure there aren't any disks in a funky or prefailure state.