Tonight i received a message generated by mdadm on my server:
This is an automatically generated mail message from mdadm
A DegradedArray event had been detected on md device /dev/md3.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
474335104 blocks [2/2] [UU]
md3 : active raid1 sdb3[2](F) sda3[1]
10000384 blocks [2/1] [_U]
md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
4000064 blocks [2/2] [UU]
md1 : active raid1 sdb1[0] sda1[1]
48064 blocks [2/2] [UU]
I removed /dev/sdb3 from /dev/md3 and re-added it, it was rebuilding for a while and become a spare device, so now i have such stats:
cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
474335104 blocks [2/2] [UU]
md3 : active raid1 sdb3[2](S) sda3[1]
10000384 blocks [2/1] [_U]
md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
4000064 blocks [2/2] [UU]
md1 : active raid1 sdb1[0] sda1[1]
48064 blocks [2/2] [UU]
and
[CODE]
mdadm -D /dev/md3
/dev/md3:
Version : 0.90
Creation Time : Sat Jun 28 14:47:58 2008
Raid Level : raid1
Array Size : 10000384 (9.54 GiB 10.24 GB)
Used Dev Size : 10000384 (9.54 GiB 10.24 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 3
Persistence : Superblock is persistent
Update Time : Sun Sep 4 16:30:46 2011
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 1c32c34a:52d09232:fc218793:7801d094
Events : 0.7172118
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 3 1 active sync /dev/sda3
2 8 19 - spare /dev/sdb3
Here is last logs in /var/log/messages
Sep 4 16:15:45 ogw2 kernel: [1314646.950806] md: unbind<sdb3>
Sep 4 16:15:45 ogw2 kernel: [1314646.950820] md: export_rdev(sdb3)
Sep 4 16:17:00 ogw2 kernel: [1314721.977950] md: bind<sdb3>
Sep 4 16:17:00 ogw2 kernel: [1314722.011058] RAID1 conf printout:
Sep 4 16:17:00 ogw2 kernel: [1314722.011064] --- wd:1 rd:2
Sep 4 16:17:00 ogw2 kernel: [1314722.011070] disk 0, wo:1, o:1, dev:sdb3
Sep 4 16:17:00 ogw2 kernel: [1314722.011073] disk 1, wo:0, o:1, dev:sda3
Sep 4 16:17:00 ogw2 kernel: [1314722.012667] md: recovery of RAID array md3
Sep 4 16:17:00 ogw2 kernel: [1314722.012673] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Sep 4 16:17:00 ogw2 kernel: [1314722.012677] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Sep 4 16:17:00 ogw2 kernel: [1314722.012684] md: using 128k window, over a total of 10000384 blocks.
Sep 4 16:20:25 ogw2 kernel: [1314927.480582] md: md3: recovery done.
Sep 4 16:20:27 ogw2 kernel: [1314929.252395] ata2.00: configured for UDMA/133
Sep 4 16:20:27 ogw2 kernel: [1314929.260419] ata2.01: configured for UDMA/133
Sep 4 16:20:27 ogw2 kernel: [1314929.260437] ata2: EH complete
Sep 4 16:20:29 ogw2 kernel: [1314931.068402] ata2.00: configured for UDMA/133
Sep 4 16:20:29 ogw2 kernel: [1314931.076418] ata2.01: configured for UDMA/133
Sep 4 16:20:29 ogw2 kernel: [1314931.076436] ata2: EH complete
Sep 4 16:20:30 ogw2 kernel: [1314932.884390] ata2.00: configured for UDMA/133
Sep 4 16:20:30 ogw2 kernel: [1314932.892419] ata2.01: configured for UDMA/133
Sep 4 16:20:30 ogw2 kernel: [1314932.892436] ata2: EH complete
Sep 4 16:20:32 ogw2 kernel: [1314934.828390] ata2.00: configured for UDMA/133
Sep 4 16:20:32 ogw2 kernel: [1314934.836397] ata2.01: configured for UDMA/133
Sep 4 16:20:32 ogw2 kernel: [1314934.836413] ata2: EH complete
Sep 4 16:20:34 ogw2 kernel: [1314936.776392] ata2.00: configured for UDMA/133
Sep 4 16:20:34 ogw2 kernel: [1314936.784403] ata2.01: configured for UDMA/133
Sep 4 16:20:34 ogw2 kernel: [1314936.784419] ata2: EH complete
Sep 4 16:20:36 ogw2 kernel: [1314938.760392] ata2.00: configured for UDMA/133
Sep 4 16:20:36 ogw2 kernel: [1314938.768395] ata2.01: configured for UDMA/133
Sep 4 16:20:36 ogw2 kernel: [1314938.768422] sd 1:0:0:0: [sda] Unhandled sense code
Sep 4 16:20:36 ogw2 kernel: [1314938.768426] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 4 16:20:36 ogw2 kernel: [1314938.768431] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep 4 16:20:36 ogw2 kernel: [1314938.768438] Descriptor sense data with sense descriptors (in hex):
Sep 4 16:20:36 ogw2 kernel: [1314938.768441] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 4 16:20:36 ogw2 kernel: [1314938.768454] 01 ac b6 4a
Sep 4 16:20:36 ogw2 kernel: [1314938.768459] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep 4 16:20:36 ogw2 kernel: [1314938.768468] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b5 f8 00 03 80 00
Sep 4 16:20:36 ogw2 kernel: [1314938.768527] ata2: EH complete
Sep 4 16:20:38 ogw2 kernel: [1314940.788406] ata2.00: configured for UDMA/133
Sep 4 16:20:38 ogw2 kernel: [1314940.796394] ata2.01: configured for UDMA/133
Sep 4 16:20:38 ogw2 kernel: [1314940.796415] ata2: EH complete
Sep 4 16:20:40 ogw2 kernel: [1314942.728391] ata2.00: configured for UDMA/133
Sep 4 16:20:40 ogw2 kernel: [1314942.736395] ata2.01: configured for UDMA/133
Sep 4 16:20:40 ogw2 kernel: [1314942.736413] ata2: EH complete
Sep 4 16:20:42 ogw2 kernel: [1314944.548391] ata2.00: configured for UDMA/133
Sep 4 16:20:42 ogw2 kernel: [1314944.556393] ata2.01: configured for UDMA/133
Sep 4 16:20:42 ogw2 kernel: [1314944.556414] ata2: EH complete
Sep 4 16:20:44 ogw2 kernel: [1314946.372392] ata2.00: configured for UDMA/133
Sep 4 16:20:44 ogw2 kernel: [1314946.380392] ata2.01: configured for UDMA/133
Sep 4 16:20:44 ogw2 kernel: [1314946.380411] ata2: EH complete
Sep 4 16:20:46 ogw2 kernel: [1314948.196391] ata2.00: configured for UDMA/133
Sep 4 16:20:46 ogw2 kernel: [1314948.204391] ata2.01: configured for UDMA/133
Sep 4 16:20:46 ogw2 kernel: [1314948.204411] ata2: EH complete
Sep 4 16:20:48 ogw2 kernel: [1314950.144390] ata2.00: configured for UDMA/133
Sep 4 16:20:48 ogw2 kernel: [1314950.152392] ata2.01: configured for UDMA/133
Sep 4 16:20:48 ogw2 kernel: [1314950.152416] sd 1:0:0:0: [sda] Unhandled sense code
Sep 4 16:20:48 ogw2 kernel: [1314950.152419] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 4 16:20:48 ogw2 kernel: [1314950.152424] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep 4 16:20:48 ogw2 kernel: [1314950.152431] Descriptor sense data with sense descriptors (in hex):
Sep 4 16:20:48 ogw2 kernel: [1314950.152434] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 4 16:20:48 ogw2 kernel: [1314950.152447] 01 ac b6 4a
Sep 4 16:20:48 ogw2 kernel: [1314950.152452] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep 4 16:20:48 ogw2 kernel: [1314950.152461] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b6 48 00 00 08 00
Sep 4 16:20:48 ogw2 kernel: [1314950.152523] ata2: EH complete
Sep 4 16:20:48 ogw2 kernel: [1314950.575325] RAID1 conf printout:
Sep 4 16:20:48 ogw2 kernel: [1314950.575332] --- wd:1 rd:2
Sep 4 16:20:48 ogw2 kernel: [1314950.575337] disk 0, wo:1, o:1, dev:sdb3
Sep 4 16:20:48 ogw2 kernel: [1314950.575341] disk 1, wo:0, o:1, dev:sda3
Sep 4 16:20:48 ogw2 kernel: [1314950.575344] RAID1 conf printout:
Sep 4 16:20:48 ogw2 kernel: [1314950.575347] --- wd:1 rd:2
Sep 4 16:20:48 ogw2 kernel: [1314950.575350] disk 1, wo:0, o:1, dev:sda3
So i cant understand why this device (sdb3) become SPARE and RAID isnt synced...
Can anybody point me out what to do?
UPDATE: forgot to say that /dev/md3 is mounted as / (root) partition and includes all system directories except of /boot.
Looks like MD kept the wrong device. sda is going bad, an threw an unrecoverable read error when reading blocks off of it to resync sdb.
Will the data have changed on sda after sdb was removed? If not, then you may be in luck - the filesystem on sdb might still be in a consistent state even after the failed resync; get MD to assemble the array with sdb instead.
That's a bit of a long shot, though; more likely, you'll be getting a good chance to see how well your backup strategy works.
Note that ALL your MD arrays are in jeopardy---not just the one that is "officially" degraded---since they are all based on just two physical devices:
sda
andsdb
. I sure hope you do have proper backups and/or system-recovery procedures in place, just in case things go REALLY pear-shaped. As Shane Madden noted, the log of the resync shows a worrying error that may be indicating thatsda
is less than healthy itself.The best thing to do is to pull
sdb
and replace it immediately. If you don't have a replacement handy, then order one ASAP (and perhaps use the intervening time to take one last full backup of all your arrays while they're still good!). Your replacement drive will need to be partitioned appropriately, and then the partitions added correspondingly to each of your four arrays. Hopefully, all will go well and all arrays will resync successfully.However, if Shane is correct, and further errors from a failing
sda
prevent proper reassembly/resync, then the next thing to try will be to pullsda
, replace it with your oldsdb
(which may still be good), and see if the combination of your oldsdb
and your new replacement drive reassemble and resync successfully.And finally, if none of the above works, the very last thing to try (before a complete system rebuild and restore) is to replace the drive controller(s). I have seen drive controllers flake-out and cause problems for otherwise healthy arrays. One way to test whether a controller might be the cause of MD errors is to put one of your "failed" drives into another Linux machine with a known-good controller and the
mdadm
tools installed. Since all your arrays are RAID1, the arrays on any single drive should be able to be assembled to a usable state (albeit degraded), where you can then check filesystems, take backups, and so on.