I have inherited a critical situation as follows.
- 2 x 16-Disk RAID5 Storage systems (one holding master data, one holding backups)
- Backup system had no monitoring and two disks failed so all data is lost - not a huge issue
- Master system is showing 2 disks with media errors, one holding steady at around 30 and the other at about 2,000 but slowly growing (was 2,100 after a week or so)
There are longer term plans to use better storage, use hot spares, put in place better monitoring, set up mirroring, backups etc. etc. etc. but the immediate need is to protect the master data as it is crucial to the business but is sitting on a RAID5 array with two disks showing errors
We have basically boiled down the options to one of
Option 1
- Swap out the disk with 2,000 media errors and let the RAID5 array rebuild
- Once complete, swap out the other disk with media errors
Main concern with this is that whilst the array is being rebuilt (24-48 hours?), there is zero redundancy in the system and any disk failures would mean loss of all data.
Option 2
- Leave the RAID5 array as is and copy the data onto a new storage array
Main concern with this is that it will take many times longer than the RAID rebuild as the filesystem has many 100's of millions of little files so the copy could take close to a month to complete without impacting the site that is using the files
I would be interested in views on which approach you would take and why? Are media errors of this level worrying? Is the level of growth in the media errors worrying?
Yes, I'd worry, and given your situation I'd get another system in and make a backup ASAP as any attempt to rebuild can easily result in losing everything.
The fun part of RAID 5 is that you may have an URE on another drive showing currently as okay, so even disks you think are working, aren't. Hence your "rebuilding error."
Get a system in place to copy your data over and start backing up those files ASAP. Then worry about rebuilding the server.
...although personally once you get the backup in place and know it's good, I'd change your server over entirely then to something with RAID 10 or 6, start fresh...