Ping a Specific Port

Question

Ben Hymers

Asked: 2011-03-09 13:00:29 +0800 CST2011-03-09 13:00:29 +0800 CST 2011-03-09 13:00:29 +0800 CST

How do I stop and repair a RAID 5 array that has failed and has I/O pending?

772

The short version: I have a failed RAID 5 array which has a bunch of processes hung waiting on I/O operations on it; how can I recover from this?

The long version: Yesterday I noticed Samba access was being very sporadic; accessing the server's shares from Windows would randomly lock up explorer completely after clicking on one or two directories. I assumed it was Windows being a pain and left it. Today the problem is the same, so I did a little digging; the first thing I noticed was that running ps aux | grep smbd gives a lot of lines like this:

ben        969  0.0  0.2  96088  4128 ?        D    18:21   0:00 smbd -F
root      1708  0.0  0.2  93468  4748 ?        Ss   18:44   0:00 smbd -F
root      1711  0.0  0.0  93468  1364 ?        S    18:44   0:00 smbd -F
ben       3148  0.0  0.2  96052  4160 ?        D    Mar07   0:00 smbd -F
...

There are a lot of processes stuck in the "D" state. Running ps aux | grep " D" shows up some other processes including my nightly backup script, all of which need to access the volume mounted on my RAID array at some point. After some googling, I found that it might be down to the RAID array failing, so I checked /proc/mdstat, which shows this:

ben@jack:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[3](F) sdc1[1] sdd1[2]
      2930271872 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]

unused devices: <none>

And running mdadm --detail /dev/md0 gives this:

ben@jack:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Sat Oct 31 20:53:10 2009
     Raid Level : raid5
     Array Size : 2930271872 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Mar  7 03:06:35 2011
          State : active, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : f114711a:c770de54:c8276759:b34deaa0
         Events : 0.208245

    Number   Major   Minor   RaidDevice State
       3       8       17        0      faulty spare rebuilding   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1

I believe this says that sdb1 has failed, and so the array is running with two drives out of three 'up'. Some advice I found said to check /var/log/messages for notices of failures, and sure enough there are plenty:

ben@jack:~$ grep sdb /var/log/messages

...

Mar  7 03:06:35 jack kernel: [4525155.384937] md/raid:md0: read error NOT corrected!! (sector 400644912 on sdb1).
Mar  7 03:06:35 jack kernel: [4525155.389686] md/raid:md0: read error not correctable (sector 400644920 on sdb1).
Mar  7 03:06:35 jack kernel: [4525155.389686] md/raid:md0: read error not correctable (sector 400644928 on sdb1).
Mar  7 03:06:35 jack kernel: [4525155.389688] md/raid:md0: read error not correctable (sector 400644936 on sdb1).
Mar  7 03:06:56 jack kernel: [4525176.231603] sd 0:0:1:0: [sdb] Unhandled sense code
Mar  7 03:06:56 jack kernel: [4525176.231605] sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar  7 03:06:56 jack kernel: [4525176.231608] sd 0:0:1:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Mar  7 03:06:56 jack kernel: [4525176.231623] sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Mar  7 03:06:56 jack kernel: [4525176.231627] sd 0:0:1:0: [sdb] CDB: Read(10): 28 00 17 e1 5f bf 00 01 00 00

To me it is clear that device sdb has failed, and I need to stop the array, shutdown, replace it, reboot, then repair the array, bring it back up and mount the filesystem. I cannot hot-swap a replacement drive in, and don't want to leave the array running in a degraded state. I believe I am supposed to unmount the filesystem before stopping the array, but that is failing, and that is where I'm stuck now:

ben@jack:~$ sudo umount /storage
umount: /storage: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

It is indeed busy; there are some 30 or 40 processes waiting on I/O.

What should I do? Should I kill all these processes and try again? Is that a wise move when they are 'uninterruptable'? What would happen if I tried to reboot?

Please let me know what you think I should do. And please ask if you need any extra information to diagnose the problem or to help!

3 Answers

Voted

wazoox · Answer 1 · 2011-03-09T13:55:31+08:00

Best Answer

wazoox

2011-03-09T13:55:31+08:002011-03-09T13:55:31+08:00

I don't think you need to stop the array. Simply fail /dev/sdb, remove it (I suppose it's a pluggable hard drive), and plug a new drive that you'll declare as hot spare.

4

sciurus · Answer 2 · 2011-03-09T13:17:34+08:00

sciurus

2011-03-09T13:17:34+08:002011-03-09T13:17:34+08:00

You can't kill a process that is attempting i/o. What you'll have to do is use the lazy option of the umount command to remove the filesystem from the filesystem namespace even though files on it are still open. For more information on this (and other "quirks" of this aspect of linux's design), see Neil Brown.

umount -l /storage

3

Mike · Answer 3 · 2011-03-09T14:01:29+08:00

Mike

2011-03-09T14:01:29+08:002011-03-09T14:01:29+08:00

You could also stop the samba process which would stop the writes to the disk and allow for the current writes to finish rather than unmounting the filesystem which is being written to.

1

How do I stop and repair a RAID 5 array that has failed and has I/O pending?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?