Ping a Specific Port

Question

Bill Weiss

Asked: 2011-11-21 07:22:43 +0800 CST2011-11-21 07:22:43 +0800 CST 2011-11-21 07:22:43 +0800 CST

What does 3Ware's tw_cli mean by a "DEGRADED" disk vs "ECC-ERROR"?

772

I've got a sad RAID array on a 3ware 9650SE-16ML card. What I can't tell is if I've just suffered a double-disk failure (bummer!) or if I'm reading this wrong. The relavent output of /c0 show all is:

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     DEGRADED         u0     931.51 GB   1953525168    5QJ07MAH            
p1     ECC-ERROR        u0     931.51 GB   1953525168    5QJ0DCW9            
p2     OK               u0     931.51 GB   1953525168    5QJ0DW9C            
p3     OK               u0     931.51 GB   1953525168    5QJ0CKXJ

And the failure is (from show alarms):

Ctl  Date                        Severity  Alarm Message
------------------------------------------------------------------------------
c0   [Sun Nov 20 07:47:23 2011]  INFO      Rebuild started: unit=0
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Drive ECC error reported: port=1, unit=0
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Source drive error occurred: port=1, unit=0
c0   [Sun Nov 20 08:20:12 2011]  ERROR     Rebuild failed: unit=0
c0   [Sun Nov 20 08:20:12 2011]  INFO      Rebuild paused: unit=0

I think that what happened is p0 failed, and then p1 had an ECC error (aka, my data is gone). But... maybe not? It stays at 97% rebuilt, but can't get past this error.

As far as I can tell, a previous admin turned off the periodic verify, which is what got us into this state. This isn't something most people should worry about with their 3Ware RAIDs!

Update

After beating on it for a couple of days, I did the IgnoreECC bit and it rebuilt, but my data is hosed. Bummer.

3 Answers

Voted

Sergey Vlasov · Answer 1 · 2011-11-21T07:42:56+08:00

ECC error means that there is at least one unreadable sector on the drive. However, if you are lucky, that sector might not actually be used by the filesystem located on that volume, therefore you might still be able to copy your data from the array in this state.

There are also some options to ignore ECC errors during rebuild:

/cx/ux start rebuild disk=p [ignoreECC]
/cx/ux set ignoreECC=on|off

However, using these options means that the RAID stripe affected by a bad sector will be corrupted (not sure what exactly the card will do in this case — it might replace the whole stripe with zeros, or even with random data), therefore the “recovered” array might actually have undetectable corruption (if the affected stripe was in the middle of some data file). Copying your data from the array to some other place before trying to rebuild might be safer (at least you should get errors when trying to read the bad area).

You should set up scheduled verify of the array to catch unreadable sectors earlier, so that you can replace a drive which just started going bad.

ZaphodB · Answer 2 · 2011-11-21T07:34:21+08:00

ZaphodB

2011-11-21T07:34:21+08:002011-11-21T07:34:21+08:00

I have never experienced a physical drive (p0) to go into status DEGRADED, however you might be able to get back the ECC-ERROR drive or even the DEGRADED drive by removing them via

/c0 p1 remove

and then issuing a rescan

/c0 rescan

put them back into the raid unit via

maint rebuild c0 u0 p1

SATA-Drives that failed me with ECC-ERROR i was able to resurrect if even just for a few hours before failing again.

4

Sven · Answer 3 · 2011-11-21T07:33:52+08:00

Sven

2011-11-21T07:33:52+08:002011-11-21T07:33:52+08:00

It's very likely your data is gone. ECC error means an unrecoverable error while reading from this disk.

If you haven't a backup, you can try to dump the current state of the array. This might be possible because the controller doesn't know if it lost data or just an empty area (it lacks any insight into the file system).

2

What does 3Ware's tw_cli mean by a "DEGRADED" disk vs "ECC-ERROR"?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?