I have an issue with a bunch of DL180 each with P410 smart arrays with 2 logical drives, one is for the root filesystem, and the other is a large-ish 10TB filesystem that is exported by nfs.
The boxes are primarily nfs servers, and are frequently maxed out and are the bottleneck in the processing chain.
Every so often one of these 10TB logical drives fails and needs to be rebuilt. this happens about once a month, and it a pain.
The message is " Message: This logical drive has failed and cannot be used. All data on this logical drive has been lost."
We have tried updating the firmware on the disk array, and the kernel module, and various flavours of linux have been used for the host OS, debian, CentOS, and xfs and ext3 have been tried as filesystem types. However the logical drives still regularly need rebuilding from backups.
I have attached a hpacucli diagnostic output for one of the failed drives. http://pastebin.com/9zTiuSAN
some interesting output items;
Smart Array P410 in slot 1 : Identify Controller RAM Firmware Revision 2.00 ROM Firmware Revision 2.00
Any suggestions on what might be the problem, or how I might go about instrumenting these arrays/disks to get an idea of what is causing the drive to fail?
# cat output.txt | grep -B 2 'Drive Firmware Rev'
Drive Model ATA GB1000EAMYC
Drive Serial Number WMATV2509266
Drive Firmware Revision HPG2
--
Drive Model ATA GB1000EAMYC
Drive Serial Number WMATV1739564
Drive Firmware Revision HPG2
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ456MN
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ45RS3
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ460P0
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ454YN
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ4664M
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ457M9
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ46Q9E
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ4630X
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ454PD
Drive Firmware Revision HPG8
--
Drive Model ATA GB1000EAFJL
Drive Serial Number 9QJ45Z0Y
Drive Firmware Revision HPG8
--
Drive Model HP DF0146B8052
Drive Serial Number 3QN1KS7H00009949SQ4M
Drive Firmware Revision HPD5
--
Drive Model HP DF0146B8052
Drive Serial Number 3QN1KNFS00009949UX4F
Drive Firmware Revision HPD5
We had a similar issue with drives failing and an HP KB article indicated that the drive firmware was an issue. Updating the firmware is supposed to address this issue. Was unable to open your post to see if it listed driver firmware versions.
Are the disks from HP, or some other manufacturer?
It's possible that HP disks have specially customised firmware, and if your disks aren't HP ones running this customised firmware, the RAID controller might be dropping them from the RAID array for various reasons.
If this is the case (non HP disks) I'm not sure you'll find a definitive answer (or, unfortunately, a solution), since you can't reliably predict how the disks will act in this RAID controller, and HP will have nothing to do with it.