This is similar to 3 drives fell out of Raid6 mdadm - rebuilding? except that it is not due to a failing cable. Instead the 3rd drive fell offline during rebuild of another drive.
The drive failed with:
kernel: end_request: I/O error, dev sdc, sector 293732432
kernel: md/raid:md0: read error not correctable (sector 293734224 on sdc).
After rebooting both these sectors and the sectors around them are fine. This leads me to believe the error is intermittent and thus the device simply took too long to error correct the sector and remap it.
I expect that no data was written to the RAID after it failed. Therefore I hope that if I can kick the last failing device online that the RAID is fine and that the xfs_filesystem is OK, maybe with a few missing recent files.
Taking a backup of the disks in the RAID takes 24 hours, so I would prefer that the solution works the first time.
I have therefore set up a test scenario:
export PRE=3
parallel dd if=/dev/zero of=/tmp/raid${PRE}{} bs=1k count=1000k ::: 1 2 3 4 5
parallel mknod /dev/loop${PRE}{} b 7 ${PRE}{} \; losetup /dev/loop${PRE}{} /tmp/raid${PRE}{} ::: 1 2 3 4 5
mdadm --create /dev/md$PRE -c 4096 --level=6 --raid-devices=5 /dev/loop${PRE}[12345]
cat /proc/mdstat
mkfs.xfs -f /dev/md$PRE
mkdir -p /mnt/disk2
umount -l /mnt/disk2
mount /dev/md$PRE /mnt/disk2
seq 1000 | parallel -j1 mkdir -p /mnt/disk2/{}\;cp /bin/* /mnt/disk2/{}\;sleep 0.5 &
mdadm --fail /dev/md$PRE /dev/loop${PRE}3 /dev/loop${PRE}4
cat /proc/mdstat
# Assume reboot so no process is using the dir
kill %1; sync &
kill %1; sync &
# Force fail one too many
mdadm --fail /dev/md$PRE /dev/loop${PRE}1
parallel --tag -k mdadm -E ::: /dev/loop${PRE}? | grep Upda
# loop 2,5 are newest. loop1 almost newest => force add loop1
Next step is to add loop1 back - and this is where I am stuck.
After that do a xfs-consistency check.
When that works, check that the solution also works on real devices (such a 4 USB sticks).
The magic seems to be
mdadm -A --force
and then only giving the devices that are known good + the last failing device. For the test scenario that would be:This starts the RAID-device.
xfs_check
tells you to mount the disk to replay the log:At this point do not use the directory: In the test scenario I have at least once had xfs complain and crash. So instead:
and then:
This took 20 minutes on a 50 TB filesystem. Oddly enough most of the time was CPU time and not waiting for disk I/O. It used in the order of 100 GB RAM.
Now the file system is usable again:
Everything up to the last
sync
is OK. Only stuff written after the last sync is flakey.Add some spares and do the rebuild.
When the copying of the existing disks finishes tomorrow I will test out the above. If it works, then the above is an answer. Otherwise a new copying of the original set will be started, and new ideas are welcome (but please test them on the test scenario).
==
Spares are now added and rebuild started. Every 1000th file was copied to a dir on the file system and this did not cause issues in the logs. So it seems the filesystem is OK. It remains to be seen if the users miss some files.
==
No users have reported missing files so far, so it seems to work.