Ping a Specific Port

Question

ScottZ

Asked: 2010-01-28 11:22:16 +0800 CST2010-01-28 11:22:16 +0800 CST 2010-01-28 11:22:16 +0800 CST

Linux Software Raid 10 hung after 1 drive failed, mdadm will not let me force remove the faulty device

772

I have a Linux software raid 10 setup consisting of 5 RAID 1s (Two drives per mirrored setup) and a RAID 0 across all 5 RAID 1 pairs. To test that none of the drives were going to fail quickly under load I used badblocks across the RAID 0 with a destructive read/write mode.

Badblocks command: badblocks -b 4096 -c 98304 -p 0 -w -s /dev/md13

One of the devices failed and instead of the badblocks program happily moving on it hung. If I run a sync command this also hangs. First I would assume this isn't standard behavior for a RAID 1 device. If one of the drives fails it should still be able to write to the virtual device that the two drives make up without a problem.

So I proceeded to force fail the drive and try to remove it. I can set the drive to faulty without any problem (However the IO operations are still hung). I cannot remove the device entirely from the raid it says it is busy. My assumption is that if I can kick it out of the raid entirely the IO will continue but that is just an assumption and I do think I am dealing with a bug of sorts.

What is going on here exactly? Am I in an unrecoverable spot due to a bug?

The system is running kernel 2.6.18 so it isn't exactly new but I would think given that software raid has been around for so long issues like these would not happen.

Any insight is greatly appreciated.

mdadm --detail /dev/md13

/dev/md13:

    Version : 00.90.03   Creation Time : Thu Jan 21 14:21:57 2010
 Raid Level : raid0
 Array Size : 2441919360 (2328.80 GiB 2500.53 GB)    Raid Devices : 5

Total Devices : 5 Preferred Minor : 13 Persistence : Superblock is persistent

Update Time : Thu Jan 21 14:21:57 2010
      State : clean  Active Devices : 5 Working Devices : 5

Failed Devices : 0 Spare Devices : 0

 Chunk Size : 64K

       UUID : cfabfaee:06cf0cb2:22929c7b:7b037984
     Events : 0.3

Number   Major   Minor   RaidDevice State
   0       9        7        0      active sync   /dev/md7
   1       9        8        1      active sync   /dev/md8
   2       9        9        2      active sync   /dev/md9
   3       9       10        3      active sync   /dev/md10
   4       9       11        4      active sync   /dev/md11

The failing raid output:

/dev/md8: Version : 00.90.03 Creation Time : Thu Jan 21 14:20:47 2010 Raid Level : raid1 Array Size : 488383936 (465.76 GiB 500.11 GB) Device Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 2
Total Devices : 2 Preferred Minor : 8 Persistence : Superblock is persistent
Update Time : Mon Jan 25 04:52:25 2010
      State : active, degraded  Active Devices : 1 Working Devices : 1
Failed Devices : 1 Spare Devices : 0
       UUID : 2865aefa:ab6358d8:8f82caf4:1663e806
     Events : 0.11

Number   Major   Minor   RaidDevice State
   0      65       17        0      active sync   /dev/sdr1
   1       8      209        1      faulty   /dev/sdn1

2 Answers

Voted

twistedbrain · Answer 1 · 2010-02-06T11:36:03+08:00

twistedbrain

2010-02-06T11:36:03+08:002010-02-06T11:36:03+08:00

Sorry, maybe I didn't understand well and a cat /proc/mdstat could be helpfull, but as far as I can see you shooted yourself in the foot destroying your data on RAID0 and so on the underlying RAID1 arrays. It is, if you have to test the RAID reliability you must tag as failed a drive, a disk, not to destroy logical blocks that refers to all underlaying RAID1 disks, if I understood well the problem (let me know).

1

Dom · Answer 2 · 2010-01-28T13:32:21+08:00

Dom

2010-01-28T13:32:21+08:002010-01-28T13:32:21+08:00

Maybe you need ask to the kernel to remove the faulty drive. it will release the hangy RAID.

You can remove it with script like http://bash.cyberciti.biz/diskadmin/rescan-linux-scsi-bus/

0

Linux Software Raid 10 hung after 1 drive failed, mdadm will not let me force remove the faulty device

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?