This morning I came in the office to discover that two of the drives on a RAID-6, 3ware 9650SE controller were marked as degraded and it was rebuilding the array. After getting to about 4%, it got ECC errors on a third drive (this may have happened when I attempted to access the filesystem on this RAID and got I/O errors from the controller). Now I'm in this state:
> /c2/u1 show
Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB)
------------------------------------------------------------------------
u1 RAID-6 REBUILDING 4%(A) - - 64K 7450.5
u1-0 DISK OK - - p5 - 931.312
u1-1 DISK OK - - p2 - 931.312
u1-2 DISK OK - - p1 - 931.312
u1-3 DISK OK - - p4 - 931.312
u1-4 DISK OK - - p11 - 931.312
u1-5 DISK DEGRADED - - p6 - 931.312
u1-6 DISK OK - - p7 - 931.312
u1-7 DISK DEGRADED - - p3 - 931.312
u1-8 DISK WARNING - - p9 - 931.312
u1-9 DISK OK - - p10 - 931.312
u1/v0 Volume - - - - - 7450.5
Examining the SMART data on the three drives in question, the two that are DEGRADED are in good shape (PASSED without any Current_Pending_Sector or Offline_Uncorrectable errors), but the drive listed as WARNING has 24 uncorrectable sectors.
And, the "rebuild" has been stuck at 4% for ten hours now.
So:
How do I get it to start actually rebuilding? This particular controller doesn't appear to support /c2/u1 resume rebuild
, and the only rebuild command that appears to be an option is one that wants to know what disk to add (/c2/u1 start rebuild disk=<p:-p...> [ignoreECC]
according to the help). I have two hot spares in the server, and I'm happy to engage them, but I don't understand what it would do with that information in the current state it's in.
Can I pull out the drive that is demonstrably failing (the WARNING drive), when I have two DEGRADED drives in a RAID-6? It seems to me that the best scenario would be for me to pull the WARNING drive and tell it to use one of my hot spares in the rebuild. But won't I kill the thing by pulling a "good" drive in a RAID-6 with two DEGRADED drives?
Finally, I've seen reference in other posts to a bad bug in this controller that causes good drives to be marked as bad and that upgrading the firmware may help. Is flashing the firmware a risky operation given the situation? Is it likely to help or hurt wrt the rebuilding-but-stuck-at-4% RAID? Am I experiencing this bug in action?
Advice outside the spiritual would be much appreciated. Thanks.
I managed to get the RAID to rebuild by issuing the following command in
tw_cli
without pulling any drives or rebooting the system:The rebuild didn't proceed immediately, but at 2 AM the morning after I made this change, the rebuild started and about 6 hours later, it was complete. The drive with ECC errors had 24 bad sectors that have now been overwritten and reallocated by the drive (according to the SMART data). The filesystem seems intact, but I won't be surprised if I hit errors when I get to whatever data was on those sectors.
In any case, I'm much better off that I was before, and will likely be able to recover the majority of the data. Once I've gotten what I can I'll pop out the drive that's failing and have it rebuild onto a hot spare.