here is my situation.
I have a Dell Server with a Dell Perc 7i controller, (LSI Controller).
I had a drive give me an Failure Predicted warning so I called their support and they came out and replaced the drive and the array rebuilt itself, pretty standard.
Two weeks later, I have another drive giving me the Failure Predicted warning. I figured maybe it was a bad batch of drives or coincidence, etc. So I contact support and look more in-depth. I realize that there were bad blocks on one of the other drives that didn't fail and those bad blocks were copied over during the rebuild. So now I have bad blocks all over the place and they are slowly killing my array. I have come to find that this is called a Punctured Array.
So their advice was to replace all the drives, rebuild the array, and restore from backup. Except i've been having this issue for a few weeks which means my backups are bad...and if I restore from a backup from prior (a month ago) then I will be missing about 4 weeks worth of data from my database which is totally unacceptable for our office.
My question is...has anyone ever recovered from something like this without having to lose data or without the whole (throw it all out the window and start over) approach ?
I did find one link that covered my scenario, not sure if it sheds any light on the situation : http://www.theprojectbot.com/raid/what-is-a-punctured-raid-array/
Any help or direction would be appreciated ! What do you guys think?
Your system I assume is still up, so the best thing to do is make an immediate backup, dump the disks/array, rebuild, and restore from the backup.
Bad blocks don't always mean your backups are also bad. If you haven't experienced any performance problems or damaged files, then your backups should still be complete enough to finish a restore.
To test, take your most recent backup and examine your most important data. If it's still intact, you likely have a good backup.
At this point, there is a risk involved as you cannot be 100% certain that your backups are good or that backing up now won't cause file loss. However, your array will eventually fail and force a restore anyway, so this is your only real option.
Right this instant, do the following:
Hopefully the disks are still good enough that your data is intact, and you won't encounter any problems running the new full backup.
Then scrap those disks, and build a new RAID array. Once that's ready, try to restore from the backup you took just now. With any luck, that'll be all you need to do.
If that fails, try the next oldest, and the next oldest, etc. Be sure to test the functionality of the system - just because it boots, doesn't mean it's fully operational. Particularly, test the databases for corruption.
If you had to restore the entire system from an older backup, that's ok. Take the newest backups, and restore just the database files and other important files. Test them to make sure they work properly. Again, if that fails, try the next oldest.
Using this process minimizes the data loss.
The answers provided by Grant and Nathan C are great in regards to how you should proceed in handling backups/restoring, and addressing data integrity.
Here's some clearer detail on how to handle the RAID set when it comes time to recreate the virtual disk and restore from backup:
Note: If you've been using RAID5, you should SERIOUSLY consider using RAID6 this time. RAID5 is not reliable for business critical data according to current industry standard best-practices on an array of this size. Large capacity SATA/NL-SAS disks also have a higher risk of encountering a URE during rebuilds, which results in a puncture like the one you're dealing with. RAID6 vastly reduces this risk, and is generally acceptable for critical data with currently available drive capacities.