Ping a Specific Port

Question

datatoo

Asked: 2011-10-18 15:16:12 +0800 CST2011-10-18 15:16:12 +0800 CST 2011-10-18 15:16:12 +0800 CST

Is it safe to mark a disk ok, in a degraded RAID 5 array?

772

Intel Matrix Storage Console 8.9 showed a degraded array with one disk failure. Yet it offers the option to mark the disk as ok and rebuild the array? When would it be appropriate to do this? Does it assess disk failure incorrectly? Why offer this option?

This is a test server, and I have backups, so am not terribly concerned and tried marking the disk as ok, and it rebuilt the volume without indicating a further problem.

BUT is there a problem anyway?

Additionally...

The great responses make me wonder, what the best methods to test the disk might be. SMART tests are mentioned below. Probably I will remove the drive, rebuild with a new one.

It still seems unclear to me whether a volume can rebuild and not show errors, as appears to have happened already with this existing drive?

5 Answers

Voted

sh-beta · Answer 1 · 2011-10-18T15:42:45+08:00

Drives can be marked as failed in an array for many reasons. Maybe there's a few defective sectors. Maybe the drive heads are failing. Maybe cosmic rays hit your drive at the right angle and time to fail a scan. Maybe their firmware has a bug that breaks under .

Some of these are reparable failures, some aren't.

The thing is, it's really hard to predict hard drive failures. Google's infamous paper found that SMART was only useful in that if it alerted, the drives were more likely to fail than if it didn't. Fully 36% of the failed drives had no SMART errors, fatal or not. So you could run a full suite of SMART scans, find none, and know no more than you do now.

But, assuming this was an out-of-the-blue failure and not an I-did-something-funny-and-it-failed failure, you already have an indication of problems with the disk. Now it's a question of value.

How much does another drive cost?
How much time would be lost for its users if this server died?
How much of your time would be lost if this server died?
How much is all that time worth?
Double this value to account (naively) for opportunity cost

I've never been in a situation where it was worth letting a drive fail. Why go through the pain? Chances are, the drive you need is pretty cheap. Just buy it and move on.

Mark Henderson · Answer 2 · 2011-10-18T15:30:23+08:00

Mark Henderson

2011-10-18T15:30:23+08:002011-10-18T15:30:23+08:00

I once had a faulty caddy in an old U160 SCSI array, that was one of 14 disks in the array. When I replaced the caddy (the disk was fine), it still thought it was failed because the disk had the same serial number.

So I marked it as OK, the array re-built and all was fine until we de-comissioned it.

It all depends on your situation, but normally I would never mark a disk as OK unless I was 100% certain that it was OK. Even at 99.9% certain, I would delete the array and start again.

2

David Schwartz · Answer 3 · 2011-10-18T15:45:32+08:00

David Schwartz

2011-10-18T15:45:32+08:002011-10-18T15:45:32+08:00

If you care about the data, replace the drive immediately with a new one and rebuild the array. You can then run extensive testing on the removed drive and requalify it for use if it passes. However, if you try to rebuild the failed drive in place, you are extending the time you are vulnerable to a double-drive failure should something go wrong during or after the rebuild process.

2

Silverfire · Answer 4 · 2011-10-18T15:43:10+08:00

Silverfire

2011-10-18T15:43:10+08:002011-10-18T15:43:10+08:00

It entirely depends on the reason the drive was failed. In some cases ive seen perfectly fine disks get failed on startup with cheap raid cards because the controller had a derp moment and didnt detect the drive. This is pretty rare though, i ran a bunch of SMART tests on the drive and did a full badblocks test run through by wiping the entire drive with DD. That particular drive was ok by all my standards and as i was running raid5 and not Linear or raid0 i added it to the array again.

Run a SMART test using a Linux recovery disk or similar, make note of the badblocks count, run a full SMART test and then look at the bad blocks count again. If it spiked by anything more then 20 i wouldn't trust it. Same if the badblocks are particularly high for that drive size/make.

The risk is not just that the drives completely fail but that your data may corrupt over time.

Can you also include the readout of "smartctl -a /dev/hda" for this drive in the original question thanks.

0

Nicholas Howard · Answer 5 · 2021-07-14T11:37:14+08:00

Yes this is old but....

Another reason to “ok” a drive is that, until you have the hardware on hand to replace the Bad drive it costs basically nothing to rebuild the array onto the bad drive, and if another drive fails before you can replace the bad one you have a chance of living on.

Specifically: On drive failure you want to:

make immediate backups of any critical data not likely to already be backed up
start a new full backup of everything on the array that included that drive while you go check that your existing full system backup exists and is viable.
ok and re-add the failed drive.
source a long-term replacements for all drives from the same lot (odds are the others are from the same lot so now should be considered questionable) and any other drives over 40k hours uptime
if new drives aren’t same-day available find any trustable drive that can replace the failed drive In the interim
on the chance this wasn’t previously done, (re)install your bootloader on each individual pertinent drive

At this point you can refail and replace the bad drive (by manually failing it after a resync you have the best odds of ripping useful data off it if worst comes to worst) and rinse repeat for each drive when the new drives come in.

Note: if it’s the primary drive And you can’t do hot-swap be prepared to point your bios at the secondary drive to boot

Is it safe to mark a disk ok, in a degraded RAID 5 array?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?