Synology has a customized version the md driver and mdadm toolsets that adds a 'DriveError' flag to the rdev->flags structure in the kernel.
Net effect - if you are unfortunate enough to get a array failure (first drive), combined with an error on a second drive - the array gets into the state of not letting you repair/reconstruct the array even though reads from the drive are working fine.
At this point, I'm not really worried about this question from the point of view of THIS array, since I've already pulled content off and am intending to reconstruct, but more from wanting to have a resolution path for this in the future, since it's the second time I've been bit by it, and I know I've seen others asking similar questions in forums.
Synology support has been less than helpful (and mostly non-responsive), and won't share any information AT ALL on dealing with the raidsets on the box.
Contents of /proc/mdstat:
ds1512-ent> cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdb5[1] sda5[5](S) sde5[4](E) sdd5[3] sdc5[2]
11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUE]
md1 : active raid1 sdb2[1] sdd2[3] sdc2[2] sde2[4] sda2[0]
2097088 blocks [5/5] [UUUUU]
md0 : active raid1 sdb1[1] sdd1[3] sdc1[2] sde1[4] sda1[0]
2490176 blocks [5/5] [UUUUU]
unused devices: <none>
Status from an mdadm --detail /dev/md2:
/dev/md2:
Version : 1.2
Creation Time : Tue Aug 7 18:51:30 2012
Raid Level : raid5
Array Size : 11702126592 (11160.02 GiB 11982.98 GB)
Used Dev Size : 2925531648 (2790.00 GiB 2995.74 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Update Time : Fri Jan 17 20:48:12 2014
State : clean, degraded
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 64K
Name : MyStorage:2
UUID : cbfdc4d8:3b78a6dd:49991e1a:2c2dc81f
Events : 427234
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 21 1 active sync /dev/sdb5
2 8 37 2 active sync /dev/sdc5
3 8 53 3 active sync /dev/sdd5
4 8 69 4 active sync /dev/sde5
5 8 5 - spare /dev/sda5
As you can see - /dev/sda5 has been re-added to the array. (It was the drive that outright failed) - but even though md sees the drive as a spare, it won't rebuild to it. /dev/sde5 in this case is the problem drive with the (E) DiskError state.
I have tried stopping the md device, running force reassembles, removing/readding sda5 from the device/etc. No change in behavior.
I was able to completely recreate the array with the following command:
mdadm --stop /dev/md2
mdadm --verbose \
--create /dev/md2 --chunk=64 --level=5 \
--raid-devices=5 missing /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5
which brought the array back to this state:
md2 : active raid5 sde5[4] sdd5[3] sdc5[2] sdb5[1]
11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU]
I then re-added /dev/sda5:
mdadm --manage /dev/md2 --add /dev/sda5
after which it started a rebuild:
md2 : active raid5 sda5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1]
11702126592 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU]
[>....................] recovery = 0.1% (4569508/2925531648) finish=908.3min speed=53595K/sec
Note the position of the "missing" drive matching the exact position of the missing slot.
Once this finishes, I think I'll probably pull the questionable drive and have it rebuild again.
I am looking for any suggestions as to whether there is any "less scary" way to do this repair - or if anyone has gone through this experience with a Synology array and knows how to force it to rebuild other than taking the md device offline and recreating the array from scratch.
Just an addition to the solution that I found after I experienced the same issue. I followed dSebastien's blog post on how to re-create the array:
I found that that method of recreating the array worked better than this above method. However after re-creating the array, the volume was still not showing on the web interface. None of my LUN's were showing. Basically showing a new array with nothing configured. I contacted Synology support, and they remoted in to fix the issue. Unfortunately, they remoted in whilst I was away from the console. I did manage to capture the session though, and looked through what they did. Whilst trying to recover some of my data, the drive crashed again, and I was back at the same situation. I recreated the array as in dSebastien's blog and then looked through the synology session to perform their update. After running the below commands, my array and LUN's appeared on the web interface, and I was able to work with them. I have practically zero experience in linux, but these were the commands I performed in my situation. Hope this can help someone else, but please use this at your own risk. It would be best to contact Synology support and get them fix this for you, as this situation might be different from yours
Another addition: I've hit a very similar issue with my one-disk / RAID level 0 device.
Synology support was very helpful and restored my device. Here's what happened, hope this helps others:
My disk had read errors on one particular block, messages in system log (
dmesg
) were:A few seconds later I received the dreadful
Volume 1 has crashed
mail from my device.-- Disclaimer: Be sure to replace the device name by your's and do not simply copy&paste these commands, as this might make things worse! --
After stopping smb I was able to re-mount the partition read only and run e2fsk with badblocks check (
-c
):(one could also use
e2fsck -C 0 -p -v -f -c /dev/md2
to run as unattended as possible, although this didn't work out in my case, because the errors had to be fixed manually. So I had to restart e2fsck. Conclusio: -p doesn't make much sense in case of disk error)Although e2fsck was able to fix the errors and smartctl also showed no more increase in Raw_Read_Error_Rate, the volume still wouldn't mount in read-write mode by the device. DSM still showed "volume crashed"
So I opened a ticket with support. It took quite a while to get things going first, but in the end they fixed it by rebuilding the RAID array with:
Be sure to check your device names (
/dev/mdX
and/dev/sdaX
) before doing anything.cat /proc/mdstat
will show the relevant information.