Ping a Specific Port

Question

cOnf_ua

Asked: 2011-09-05 05:52:11 +0800 CST2011-09-05 05:52:11 +0800 CST 2011-09-05 05:52:11 +0800 CST

Debian, mdadm, Degraded Array, disk became spare after re-added

772

Tonight i received a message generated by mdadm on my server:

This is an automatically generated mail message from mdadm

A DegradedArray event had been detected on md device /dev/md3.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
                 474335104       blocks [2/2] [UU]

md3 : active raid1 sdb3[2](F) sda3[1]
     10000384 blocks [2/1] [_U]

md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
     4000064 blocks [2/2] [UU]

md1 : active raid1 sdb1[0] sda1[1]
     48064 blocks [2/2] [UU]

I removed /dev/sdb3 from /dev/md3 and re-added it, it was rebuilding for a while and become a spare device, so now i have such stats:

cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
      474335104 blocks [2/2] [UU]

md3 : active raid1 sdb3[2](S) sda3[1]
      10000384 blocks [2/1] [_U]

md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
      4000064 blocks [2/2] [UU]

md1 : active raid1 sdb1[0] sda1[1]
      48064 blocks [2/2] [UU]

and

[CODE]

mdadm -D /dev/md3
/dev/md3:
        Version : 0.90
  Creation Time : Sat Jun 28 14:47:58 2008
     Raid Level : raid1
     Array Size : 10000384 (9.54 GiB 10.24 GB)
  Used Dev Size : 10000384 (9.54 GiB 10.24 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 3
    Persistence : Superblock is persistent

    Update Time : Sun Sep  4 16:30:46 2011
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : 1c32c34a:52d09232:fc218793:7801d094
         Events : 0.7172118

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8        3        1      active sync   /dev/sda3

       2       8       19        -      spare   /dev/sdb3

Here is last logs in /var/log/messages

Sep  4 16:15:45 ogw2 kernel: [1314646.950806] md: unbind<sdb3>
Sep  4 16:15:45 ogw2 kernel: [1314646.950820] md: export_rdev(sdb3)
Sep  4 16:17:00 ogw2 kernel: [1314721.977950] md: bind<sdb3>
Sep  4 16:17:00 ogw2 kernel: [1314722.011058] RAID1 conf printout:
Sep  4 16:17:00 ogw2 kernel: [1314722.011064]  --- wd:1 rd:2
Sep  4 16:17:00 ogw2 kernel: [1314722.011070]  disk 0, wo:1, o:1, dev:sdb3
Sep  4 16:17:00 ogw2 kernel: [1314722.011073]  disk 1, wo:0, o:1, dev:sda3
Sep  4 16:17:00 ogw2 kernel: [1314722.012667] md: recovery of RAID array md3
Sep  4 16:17:00 ogw2 kernel: [1314722.012673] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Sep  4 16:17:00 ogw2 kernel: [1314722.012677] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Sep  4 16:17:00 ogw2 kernel: [1314722.012684] md: using 128k window, over a total of 10000384 blocks.
Sep  4 16:20:25 ogw2 kernel: [1314927.480582] md: md3: recovery done.
Sep  4 16:20:27 ogw2 kernel: [1314929.252395] ata2.00: configured for UDMA/133
Sep  4 16:20:27 ogw2 kernel: [1314929.260419] ata2.01: configured for UDMA/133
Sep  4 16:20:27 ogw2 kernel: [1314929.260437] ata2: EH complete
Sep  4 16:20:29 ogw2 kernel: [1314931.068402] ata2.00: configured for UDMA/133
Sep  4 16:20:29 ogw2 kernel: [1314931.076418] ata2.01: configured for UDMA/133
Sep  4 16:20:29 ogw2 kernel: [1314931.076436] ata2: EH complete
Sep  4 16:20:30 ogw2 kernel: [1314932.884390] ata2.00: configured for UDMA/133
Sep  4 16:20:30 ogw2 kernel: [1314932.892419] ata2.01: configured for UDMA/133
Sep  4 16:20:30 ogw2 kernel: [1314932.892436] ata2: EH complete
Sep  4 16:20:32 ogw2 kernel: [1314934.828390] ata2.00: configured for UDMA/133
Sep  4 16:20:32 ogw2 kernel: [1314934.836397] ata2.01: configured for UDMA/133
Sep  4 16:20:32 ogw2 kernel: [1314934.836413] ata2: EH complete
Sep  4 16:20:34 ogw2 kernel: [1314936.776392] ata2.00: configured for UDMA/133
Sep  4 16:20:34 ogw2 kernel: [1314936.784403] ata2.01: configured for UDMA/133
Sep  4 16:20:34 ogw2 kernel: [1314936.784419] ata2: EH complete
Sep  4 16:20:36 ogw2 kernel: [1314938.760392] ata2.00: configured for UDMA/133
Sep  4 16:20:36 ogw2 kernel: [1314938.768395] ata2.01: configured for UDMA/133
Sep  4 16:20:36 ogw2 kernel: [1314938.768422] sd 1:0:0:0: [sda] Unhandled sense code
Sep  4 16:20:36 ogw2 kernel: [1314938.768426] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep  4 16:20:36 ogw2 kernel: [1314938.768431] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep  4 16:20:36 ogw2 kernel: [1314938.768438] Descriptor sense data with sense descriptors (in hex):
Sep  4 16:20:36 ogw2 kernel: [1314938.768441]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep  4 16:20:36 ogw2 kernel: [1314938.768454]         01 ac b6 4a
Sep  4 16:20:36 ogw2 kernel: [1314938.768459] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep  4 16:20:36 ogw2 kernel: [1314938.768468] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b5 f8 00 03 80 00
Sep  4 16:20:36 ogw2 kernel: [1314938.768527] ata2: EH complete
Sep  4 16:20:38 ogw2 kernel: [1314940.788406] ata2.00: configured for UDMA/133
Sep  4 16:20:38 ogw2 kernel: [1314940.796394] ata2.01: configured for UDMA/133
Sep  4 16:20:38 ogw2 kernel: [1314940.796415] ata2: EH complete
Sep  4 16:20:40 ogw2 kernel: [1314942.728391] ata2.00: configured for UDMA/133
Sep  4 16:20:40 ogw2 kernel: [1314942.736395] ata2.01: configured for UDMA/133
Sep  4 16:20:40 ogw2 kernel: [1314942.736413] ata2: EH complete
Sep  4 16:20:42 ogw2 kernel: [1314944.548391] ata2.00: configured for UDMA/133
Sep  4 16:20:42 ogw2 kernel: [1314944.556393] ata2.01: configured for UDMA/133
Sep  4 16:20:42 ogw2 kernel: [1314944.556414] ata2: EH complete
Sep  4 16:20:44 ogw2 kernel: [1314946.372392] ata2.00: configured for UDMA/133
Sep  4 16:20:44 ogw2 kernel: [1314946.380392] ata2.01: configured for UDMA/133
Sep  4 16:20:44 ogw2 kernel: [1314946.380411] ata2: EH complete
Sep  4 16:20:46 ogw2 kernel: [1314948.196391] ata2.00: configured for UDMA/133
Sep  4 16:20:46 ogw2 kernel: [1314948.204391] ata2.01: configured for UDMA/133
Sep  4 16:20:46 ogw2 kernel: [1314948.204411] ata2: EH complete
Sep  4 16:20:48 ogw2 kernel: [1314950.144390] ata2.00: configured for UDMA/133
Sep  4 16:20:48 ogw2 kernel: [1314950.152392] ata2.01: configured for UDMA/133
Sep  4 16:20:48 ogw2 kernel: [1314950.152416] sd 1:0:0:0: [sda] Unhandled sense code
Sep  4 16:20:48 ogw2 kernel: [1314950.152419] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep  4 16:20:48 ogw2 kernel: [1314950.152424] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep  4 16:20:48 ogw2 kernel: [1314950.152431] Descriptor sense data with sense descriptors (in hex):
Sep  4 16:20:48 ogw2 kernel: [1314950.152434]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep  4 16:20:48 ogw2 kernel: [1314950.152447]         01 ac b6 4a
Sep  4 16:20:48 ogw2 kernel: [1314950.152452] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep  4 16:20:48 ogw2 kernel: [1314950.152461] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b6 48 00 00 08 00
Sep  4 16:20:48 ogw2 kernel: [1314950.152523] ata2: EH complete
Sep  4 16:20:48 ogw2 kernel: [1314950.575325] RAID1 conf printout:
Sep  4 16:20:48 ogw2 kernel: [1314950.575332]  --- wd:1 rd:2
Sep  4 16:20:48 ogw2 kernel: [1314950.575337]  disk 0, wo:1, o:1, dev:sdb3
Sep  4 16:20:48 ogw2 kernel: [1314950.575341]  disk 1, wo:0, o:1, dev:sda3
Sep  4 16:20:48 ogw2 kernel: [1314950.575344] RAID1 conf printout:
Sep  4 16:20:48 ogw2 kernel: [1314950.575347]  --- wd:1 rd:2
Sep  4 16:20:48 ogw2 kernel: [1314950.575350]  disk 1, wo:0, o:1, dev:sda3

So i cant understand why this device (sdb3) become SPARE and RAID isnt synced...

Can anybody point me out what to do?

UPDATE: forgot to say that /dev/md3 is mounted as / (root) partition and includes all system directories except of /boot.

2 Answers

Voted

Shane Madden · Answer 1 · 2011-09-05T08:16:20+08:00

Shane Madden

2011-09-05T08:16:20+08:002011-09-05T08:16:20+08:00

Looks like MD kept the wrong device. sda is going bad, an threw an unrecoverable read error when reading blocks off of it to resync sdb.

Will the data have changed on sda after sdb was removed? If not, then you may be in luck - the filesystem on sdb might still be in a consistent state even after the failed resync; get MD to assemble the array with sdb instead.

That's a bit of a long shot, though; more likely, you'll be getting a good chance to see how well your backup strategy works.

2

Steven Monday · Answer 2 · 2011-09-05T11:48:43+08:00

Note that ALL your MD arrays are in jeopardy---not just the one that is "officially" degraded---since they are all based on just two physical devices: sda and sdb. I sure hope you do have proper backups and/or system-recovery procedures in place, just in case things go REALLY pear-shaped. As Shane Madden noted, the log of the resync shows a worrying error that may be indicating that sda is less than healthy itself.

The best thing to do is to pull sdb and replace it immediately. If you don't have a replacement handy, then order one ASAP (and perhaps use the intervening time to take one last full backup of all your arrays while they're still good!). Your replacement drive will need to be partitioned appropriately, and then the partitions added correspondingly to each of your four arrays. Hopefully, all will go well and all arrays will resync successfully.

However, if Shane is correct, and further errors from a failing sda prevent proper reassembly/resync, then the next thing to try will be to pull sda, replace it with your old sdb (which may still be good), and see if the combination of your old sdb and your new replacement drive reassemble and resync successfully.

And finally, if none of the above works, the very last thing to try (before a complete system rebuild and restore) is to replace the drive controller(s). I have seen drive controllers flake-out and cause problems for otherwise healthy arrays. One way to test whether a controller might be the cause of MD errors is to put one of your "failed" drives into another Linux machine with a known-good controller and the mdadm tools installed. Since all your arrays are RAID1, the arrays on any single drive should be able to be assembled to a usable state (albeit degraded), where you can then check filesystems, take backups, and so on.

Debian, mdadm, Degraded Array, disk became spare after re-added

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?