I just recently setup a 3 drive 4TB MDRAID 5 array for mirroring and an online backup of our server.
I am preparing for a future hardware (drive) failure and wanted to mitigate a recovery failure from a URE.
Typically I think of the process for rebuilding an array to be:
- Remove and replace failed drive.
- Rebuild array
From my understanding, in a degraded RAID 5 array you can still access data; but when the failed drive has been replaced and a the array is rebuilding, if a URE is detected, the recovery will fail and the data on the array will immediately be rendered unreadable and unrecoverable.
If my understanding is correct then it does not seem prudent to recover the array until all the (readable) data has been duplicated.
This leaves me with a process of:
- Duplicate data from array.
- Remove and replace failed drive.
- Rebuild array
Is there another process that would mitigate rebuild failures (aside from a second drive failure during rebuild)? Is it safe to rebuild array without duplicating the data first? Are my assumptions wrong such as the re build fails on URE but data is still available in degraded state?
You could prepare your self to the drive failure and to very all other troubles by implementing The 3-2-1 Backup plan, my personal opinion 3-2-1 should be in each business critical environment.
Following 3-2-1 Rule will make life easier, this obviously would cost $, but the outcome should worth it.
You could learn more here: https://knowledgebase.starwindsoftware.com/explanation/the-3-2-1-backup-rule/
https://www.veeam.com/blog/the-3-2-1-0-rule-to-high-availability.html
I've realize UREs are a bit more complex and unknown to most as they relate to array failures..
The conclusion is UREs can cause arrays to fail, but not as often as that math in the articles say. But RAID 5 still is a very failure prone RAID array compared to ALL other RAID levels.
So back to basics, what are we mitigating during a RAID 5 rebuild? We are trying to get parity back before a second drive fails. THATs IT! This is a by-any-means-necessary endeavor.
This leads me to solidify my list
This assumes the array can be taken offline which is not always the case. In the end though, some have found the same that building a new array from scratch and transferring data back in one fell swoop is easier and faster, than attempting a full rebuild on a large multi TB array.
Further, I suspect that reading the data and writing the data off the array sequentially in a degraded state effectively only once would greatly lower the chances of a second drive failure before the data is duplicated compared to a full thrashing rebuild, although the chance is still there.
In the end, its all about risk management which varies on the plethora of specific circumstances. In my particular case, I can usually find time within a 24 hour window to restore my array and thus freshly backing up, rebuilding, and restoring from the fresh backup would be best in my case.