I have a Linux software raid 10 setup consisting of 5 RAID 1s (Two drives per mirrored setup) and a RAID 0 across all 5 RAID 1 pairs. To test that none of the drives were going to fail quickly under load I used badblocks across the RAID 0 with a destructive read/write mode.
Badblocks command: badblocks -b 4096 -c 98304 -p 0 -w -s /dev/md13
One of the devices failed and instead of the badblocks program happily moving on it hung. If I run a sync command this also hangs. First I would assume this isn't standard behavior for a RAID 1 device. If one of the drives fails it should still be able to write to the virtual device that the two drives make up without a problem.
So I proceeded to force fail the drive and try to remove it. I can set the drive to faulty without any problem (However the IO operations are still hung). I cannot remove the device entirely from the raid it says it is busy. My assumption is that if I can kick it out of the raid entirely the IO will continue but that is just an assumption and I do think I am dealing with a bug of sorts.
What is going on here exactly? Am I in an unrecoverable spot due to a bug?
The system is running kernel 2.6.18 so it isn't exactly new but I would think given that software raid has been around for so long issues like these would not happen.
Any insight is greatly appreciated.
mdadm --detail /dev/md13
/dev/md13:
Version : 00.90.03 Creation Time : Thu Jan 21 14:21:57 2010 Raid Level : raid0 Array Size : 2441919360 (2328.80 GiB 2500.53 GB) Raid Devices : 5
Total Devices : 5 Preferred Minor : 13 Persistence : Superblock is persistent
Update Time : Thu Jan 21 14:21:57 2010 State : clean Active Devices : 5 Working Devices : 5
Failed Devices : 0 Spare Devices : 0
Chunk Size : 64K UUID : cfabfaee:06cf0cb2:22929c7b:7b037984 Events : 0.3 Number Major Minor RaidDevice State 0 9 7 0 active sync /dev/md7 1 9 8 1 active sync /dev/md8 2 9 9 2 active sync /dev/md9 3 9 10 3 active sync /dev/md10 4 9 11 4 active sync /dev/md11
The failing raid output:
/dev/md8: Version : 00.90.03 Creation Time : Thu Jan 21 14:20:47 2010 Raid Level : raid1 Array Size : 488383936 (465.76 GiB 500.11 GB) Device Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 2
Total Devices : 2 Preferred Minor : 8 Persistence : Superblock is persistentUpdate Time : Mon Jan 25 04:52:25 2010 State : active, degraded Active Devices : 1 Working Devices : 1
Failed Devices : 1 Spare Devices : 0
UUID : 2865aefa:ab6358d8:8f82caf4:1663e806 Events : 0.11 Number Major Minor RaidDevice State 0 65 17 0 active sync /dev/sdr1 1 8 209 1 faulty /dev/sdn1
Sorry, maybe I didn't understand well and a cat /proc/mdstat could be helpfull, but as far as I can see you shooted yourself in the foot destroying your data on RAID0 and so on the underlying RAID1 arrays. It is, if you have to test the RAID reliability you must tag as failed a drive, a disk, not to destroy logical blocks that refers to all underlaying RAID1 disks, if I understood well the problem (let me know).
Maybe you need ask to the kernel to remove the faulty drive. it will release the hangy RAID.
You can remove it with script like http://bash.cyberciti.biz/diskadmin/rescan-linux-scsi-bus/