Ping a Specific Port

Question

user72593

Asked: 2014-05-23 08:01:04 +0800 CST2014-05-23 08:01:04 +0800 CST 2014-05-23 08:01:04 +0800 CST

Recover from a Punctured RAID array

772

here is my situation.

I have a Dell Server with a Dell Perc 7i controller, (LSI Controller).

I had a drive give me an Failure Predicted warning so I called their support and they came out and replaced the drive and the array rebuilt itself, pretty standard.

Two weeks later, I have another drive giving me the Failure Predicted warning. I figured maybe it was a bad batch of drives or coincidence, etc. So I contact support and look more in-depth. I realize that there were bad blocks on one of the other drives that didn't fail and those bad blocks were copied over during the rebuild. So now I have bad blocks all over the place and they are slowly killing my array. I have come to find that this is called a Punctured Array.

So their advice was to replace all the drives, rebuild the array, and restore from backup. Except i've been having this issue for a few weeks which means my backups are bad...and if I restore from a backup from prior (a month ago) then I will be missing about 4 weeks worth of data from my database which is totally unacceptable for our office.

My question is...has anyone ever recovered from something like this without having to lose data or without the whole (throw it all out the window and start over) approach ?

I did find one link that covered my scenario, not sure if it sheds any light on the situation : http://www.theprojectbot.com/raid/what-is-a-punctured-raid-array/

Any help or direction would be appreciated ! What do you guys think?

3 Answers

Voted

Nathan C · Answer 1 · 2014-05-23T08:06:23+08:00

Best Answer

Nathan C

2014-05-23T08:06:23+08:002014-05-23T08:06:23+08:00

Your system I assume is still up, so the best thing to do is make an immediate backup, dump the disks/array, rebuild, and restore from the backup.

Bad blocks don't always mean your backups are also bad. If you haven't experienced any performance problems or damaged files, then your backups should still be complete enough to finish a restore.

To test, take your most recent backup and examine your most important data. If it's still intact, you likely have a good backup.

At this point, there is a risk involved as you cannot be 100% certain that your backups are good or that backing up now won't cause file loss. However, your array will eventually fail and force a restore anyway, so this is your only real option.

15

Grant · Answer 2 · 2014-05-23T08:33:29+08:00

Right this instant, do the following:

Stop rotating backups or deleting old ones for this system. You want to keep all of the backups you currently have.
Take a full backup of the server.

Hopefully the disks are still good enough that your data is intact, and you won't encounter any problems running the new full backup.

Then scrap those disks, and build a new RAID array. Once that's ready, try to restore from the backup you took just now. With any luck, that'll be all you need to do.

If that fails, try the next oldest, and the next oldest, etc. Be sure to test the functionality of the system - just because it boots, doesn't mean it's fully operational. Particularly, test the databases for corruption.

If you had to restore the entire system from an older backup, that's ok. Take the newest backups, and restore just the database files and other important files. Test them to make sure they work properly. Again, if that fails, try the next oldest.

Using this process minimizes the data loss.

JimNim · Answer 3 · 2014-05-23T10:12:32+08:00

The answers provided by Grant and Nathan C are great in regards to how you should proceed in handling backups/restoring, and addressing data integrity.

Here's some clearer detail on how to handle the RAID set when it comes time to recreate the virtual disk and restore from backup:

Verify that you have a good backup of the data
Delete the existing virtual disk; All disks should show in a "ready" state afterward
Recreate a new Virtual Disk; Recommended settings: adaptive read-ahead, write-back, and disk caching disabled
You should have an online Virtual Disk with a background initialization in progress.
Proceed with restoring from backup; Background initialization typically runs around 600GB/hr for 7.2K spindles, so give the init a head start if your backup restore can run faster than that, otherwise your backup software might have some issues with write latency when no new space is immediately available during the restore.

Note: If you've been using RAID5, you should SERIOUSLY consider using RAID6 this time. RAID5 is not reliable for business critical data according to current industry standard best-practices on an array of this size. Large capacity SATA/NL-SAS disks also have a higher risk of encountering a URE during rebuilds, which results in a puncture like the one you're dealing with. RAID6 vastly reduces this risk, and is generally acceptable for critical data with currently available drive capacities.

Recover from a Punctured RAID array

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?