Intel Matrix Storage Console 8.9 showed a degraded array with one disk failure. Yet it offers the option to mark the disk as ok and rebuild the array? When would it be appropriate to do this? Does it assess disk failure incorrectly? Why offer this option?
This is a test server, and I have backups, so am not terribly concerned and tried marking the disk as ok, and it rebuilt the volume without indicating a further problem.
BUT is there a problem anyway?
Additionally...
The great responses make me wonder, what the best methods to test the disk might be. SMART tests are mentioned below. Probably I will remove the drive, rebuild with a new one.
It still seems unclear to me whether a volume can rebuild and not show errors, as appears to have happened already with this existing drive?
Drives can be marked as failed in an array for many reasons. Maybe there's a few defective sectors. Maybe the drive heads are failing. Maybe cosmic rays hit your drive at the right angle and time to fail a scan. Maybe their firmware has a bug that breaks under .
Some of these are reparable failures, some aren't.
The thing is, it's really hard to predict hard drive failures. Google's infamous paper found that SMART was only useful in that if it alerted, the drives were more likely to fail than if it didn't. Fully 36% of the failed drives had no SMART errors, fatal or not. So you could run a full suite of SMART scans, find none, and know no more than you do now.
But, assuming this was an out-of-the-blue failure and not an I-did-something-funny-and-it-failed failure, you already have an indication of problems with the disk. Now it's a question of value.
I've never been in a situation where it was worth letting a drive fail. Why go through the pain? Chances are, the drive you need is pretty cheap. Just buy it and move on.
I once had a faulty caddy in an old U160 SCSI array, that was one of 14 disks in the array. When I replaced the caddy (the disk was fine), it still thought it was failed because the disk had the same serial number.
So I marked it as OK, the array re-built and all was fine until we de-comissioned it.
It all depends on your situation, but normally I would never mark a disk as OK unless I was 100% certain that it was OK. Even at 99.9% certain, I would delete the array and start again.
If you care about the data, replace the drive immediately with a new one and rebuild the array. You can then run extensive testing on the removed drive and requalify it for use if it passes. However, if you try to rebuild the failed drive in place, you are extending the time you are vulnerable to a double-drive failure should something go wrong during or after the rebuild process.
It entirely depends on the reason the drive was failed. In some cases ive seen perfectly fine disks get failed on startup with cheap raid cards because the controller had a derp moment and didnt detect the drive. This is pretty rare though, i ran a bunch of SMART tests on the drive and did a full badblocks test run through by wiping the entire drive with DD. That particular drive was ok by all my standards and as i was running raid5 and not Linear or raid0 i added it to the array again.
Run a SMART test using a Linux recovery disk or similar, make note of the badblocks count, run a full SMART test and then look at the bad blocks count again. If it spiked by anything more then 20 i wouldn't trust it. Same if the badblocks are particularly high for that drive size/make.
The risk is not just that the drives completely fail but that your data may corrupt over time.
Can you also include the readout of "smartctl -a /dev/hda" for this drive in the original question thanks.
Yes this is old but....
Another reason to “ok” a drive is that, until you have the hardware on hand to replace the Bad drive it costs basically nothing to rebuild the array onto the bad drive, and if another drive fails before you can replace the bad one you have a chance of living on.
Specifically: On drive failure you want to:
At this point you can refail and replace the bad drive (by manually failing it after a resync you have the best odds of ripping useful data off it if worst comes to worst) and rinse repeat for each drive when the new drives come in.
Note: if it’s the primary drive And you can’t do hot-swap be prepared to point your bios at the secondary drive to boot