I have a software raid1 array. Today I found one of the drives failed to sync, and I got this:
$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
153597312 blocks [2/1] [U_]
So I did:
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md0 --add /dev/sda1
The devices syned about 25% of the way AFAIK, and then I ended up with mdadm calming the drive that was ok has now failed:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[1] sdb1[2](F)
153597312 blocks [2/1] [_U]
I am scared now to reattach and cause data loss.
- What is going on?
- Is there a way to test the array data is ok?
- What should I do now?
Thanks, Guy
What I would do in this scenario is to first create a new RAID-1 from two (or three) new disks to use during the recovery process. You existing RAID is only about 150GB, and 1TB disks are not expensive, so a brand new RAID-1 from two 1TB disks will be useful during the recovery process.
Once that is ready, recover as many sectors as possible from each of the faulty disks into files on your new RAID-1. This is the most critical stage of the recovery process. Any mistake until the completion of this stage could make your problem worse than it currently is.
Judging from the behavior you experienced it is likely that neither disk has a complete failure but each of them have unreadable sectors.
With a bit of luck you will be able to get a copy of every sector from at least one of the two drives. Once you have gotten past this stage, you can set the problematic drives aside and work with the recovered data on the new drives with little risk of making the situation worse.
It is likely that the data on the two drives is a bit out of sync. Due to the failed recovery attempt, you can't be completely sure which one is most up to date. And even if you did know, you'd likely see some sectors where the most up to date version was lost, and you'd be forced to use the less up to date version.
This leaves you with a bit of a puzzle to figure out exactly what can be recovered. But this part of the recovery process is not very risky if you know what you are doing.