Today we hit some kind of worst case scenario and are open to any kind of good ideas.
Here is our problem:
We are using several dedicated storage servers to host our virtual machines. Before I continue, here are the specs:
- Dedicated Server Machine
- Areca 1280ml RAID controller, Firmware 1.49
- 12x Samsung 1TB HDDs
We configured one RAID6-set with 10 discs that contains one logical volume. We have two hot spares in the system.
Today one HDD failed. This happens from time to time, so we replaced it. Upon rebuilding a second disc failed. Normally this is no fun. We stopped heavy IO-operations to ensure a stable RAID rebuild.
Sadly the hot-spare disc failed while rebuilding and the whole thing stopped.
Now we have the following situation:
- The controller says that the raid set is rebuilding
- The controller says that the volume failed
It is a RAID 6 system and two discs failed, so the data has to be intact, but we cannot bring the volume online again to access the data.
While searching we found the following leads. I don't know whether they are good or bad:
Mirroring all the discs to a second set of drives. So we would have the possibility to try different things without loosing more than we already have.
Trying to rebuild the array in R-Studio. But we have no real experience with the software.
Pulling all drives, rebooting the system, changing into the areca controller bios, reinserting the HDDs one-by-one. Some people are saying that the brought the system online by this. Some are saying that the effect is zero. Some say, that they blew the whole thing.
Using undocumented areca commands like "rescue" or "LeVel2ReScUe".
Contacting a computer forensics service. But whoa... primary estimates by phone exceeded 20.000€. That's why we would kindly ask for help. Maybe we are missing the obvious?
And yes of course, we have backups. But some systems lost one week of data, thats why we'd like to get the system up and running again.
Any help, suggestions and questions are more than welcome.
I think Option 1. is your best.
Take 12x new HDDs, 1x new RAID controller Try to mirror (dd if= of=) old disks to the new ones 1:1 using any linux box. Build a new server using the 1x new RAID controller plus the 12x new HDDs
Try to rebuild the array in the new server. Success? Great. Stop.
Rebuild failed? Mirror the old disks to new ones again, try Option i+1
This is a very common scenario unfortunately. There was a good Google study on this years ago, and it turns out that losing data with RAID can happen during rebuilding the array. This can impact different RAID systems with different severity. Here is the RAID6 scenario:
Why is that?
Think about the following: let have some data, assume first 3 block of a file you have the following data blocks: A1 + A2 + A3 and the following parity: Ap + Ap sitting on hdd1...hdd5
If you lose any two disk between 1 and 3 you lost data because the data is not recoverable, you have 2 parity and 1 data block.
Now the same scenario with 10 disks might be different, but i guess it handled the same way that you split the data to 8 blocks and save the parity to 2 other drives and have 2 hot-spares. Do you know the details of your RAID controller configuration?
I would start to recover from offsite backup (I guess you have some), and the service is back try to recover as much data as possible, using Unix and dd the drives to images and using it as loop device for example.
http://wiki.edseek.com/guide:mount_loopback
You need to know what sort of metadata the RAID controller uses and if you lucky it is supported in some tool like dmraid.
But this does not mean you can recover data at all, since the files are distributed among many-many blocks usually, the recovery is likely to fail to bring back any of your data.
More about RAID:
https://raid.wiki.kernel.org/index.php/RAID_setup