Running Fedora 32 connected to a 4-port e-sata. One of the drives is clearly failing, with this message in the logs:
smartd[1169]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Here's mdadm
:
mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 1465005464 (1397.14 GiB 1500.17 GB)
Raid Devices : 4
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Jun 8 17:33:23 2020
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 8K
Consistency Policy : resync
Name : ourserver:0 (local to host ourserver)
UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Events : 898705
Number Major Minor RaidDevice State
0 8 1 0 active sync set-A /dev/sda1
- 0 0 1 removed
3 8 49 2 active sync set-A /dev/sdd1
- 0 0 3 removed
What I'm not understanding is what happened to our 2 other drives that were a part of the RAID10?
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.4T 0 disk
└─sda1 8:1 0 1.4T 0 part
└─md0 9:0 0 2.7T 0 raid10
sdb 8:16 0 1.4T 0 disk
└─sdb1 8:17 0 1.4T 0 part
sdc 8:32 0 1.8T 0 disk
└─sdc1 8:33 0 1.8T 0 part
sdd 8:48 0 1.4T 0 disk
└─sdd1 8:49 0 1.4T 0 part
└─md0 9:0 0 2.7T 0 raid10
and:
blkid
/dev/sda1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="7df3d233-060a-aac3-04eb-9f3a65a9119e" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="0001b5c0-01"
/dev/sdb1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="64e3cedc-90db-e299-d786-7d096896f28f" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="00ff416d-01"
/dev/sdc1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="6d0134e3-1358-acfd-9c86-2967aec370c2" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="7da9b00e-01"
/dev/sdd1: UUID="88b9fcb6-52d0-f235-849b-d9d6c079cfc8" UUID_SUB="b1dd6f8b-a8e4-efa7-72b7-f987e71edeb2" LABEL="ourserver:0" TYPE="linux_raid_member" PARTUUID="b3de33a7-b2ea-f24e-903f-bae80136d543"
cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sda1[0] sdd1[3]
2930010928 blocks super 1.2 8K chunks 2 near-copies [4/2] [U_U_]
unused devices: <none>
Originally I used these 2 commands to create the RAID10:
mdadm -E /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdg1
mdadm --grow /dev/md0 --level=10 --backup-file=/home/backup-md0 --raid-devices=4 --add /dev/sdb1 /dev/sdd1 /dev/sdg1
After a few reboots the /dev/sdX
(where X
is a drive letter) convention changed. For the moment I don't have a mdadm.conf
file and I ran mdadm --assemble --force /dev/md0 /dev/sd[abcd]1
to at least get the data back, and that's how /dev/sdb
and /dev/sdc
no longer have the RAID10 Type and no md0 under /dev/sdb1
and /dev/sdc1
(from the lsblk
command above). How can I at least get back the 2 other drives, /dev/sdb
and /dev/sdc
, back into the RAID10 and then just fail /dev/sdd
until I get a replacement? Or is there a better approach?
You can see from fdisk -l
the 2 drives are formatted to be a part of the RAID10:
Disk /dev/sda: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0001b5c0
Device Boot Start End Sectors Size Id Type
/dev/sda1 2048 2930277167 2930275120 1.4T fd Linux raid autodetect
Disk /dev/sdb: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00ff416d
Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 2930277167 2930275120 1.4T fd Linux raid autodetect
Disk /dev/sdc: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DM001-1ER1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x7da9b00e
Device Boot Start End Sectors Size Id Type
/dev/sdc1 2048 3907029167 3907027120 1.8T fd Linux raid autodetect
Disk /dev/sdd: 1.37 TiB, 1500301910016 bytes, 2930277168 sectors
Disk model: ST31500341AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DC9A2601-CFE8-4ADD-85CD-FCBEBFCD8FAF
Device Start End Sectors Size Type
/dev/sdd1 34 2930277134 2930277101 1.4T Linux RAID
And examining all of the 4 drives shows they are active:
mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Name : ourserver :0 (local to host ourserver )
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Raid Devices : 4
Avail Dev Size : 2930010944 (1397.14 GiB 1500.17 GB)
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
Data Offset : 264176 sectors
Super Offset : 8 sectors
Unused Space : before=264096 sectors, after=16 sectors
State : clean
Device UUID : 7df3d233:060aaac3:04eb9f3a:65a9119e
Update Time : Mon Jun 8 17:33:23 2020
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : 6ad0f3f7 - correct
Events : 898705
Layout : near=2
Chunk Size : 8K
Device Role : Active device 0
Array State : A.A. ('A' == active, '.' == missing, 'R' == replacing)
mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Name : ourserver :0 (local to host ourserver )
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Raid Devices : 4
Avail Dev Size : 2930010944 (1397.14 GiB 1500.17 GB)
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
Data Offset : 264176 sectors
Super Offset : 8 sectors
Unused Space : before=263896 sectors, after=16 sectors
State : clean
Device UUID : 64e3cedc:90dbe299:d7867d09:6896f28f
Update Time : Wed Mar 18 11:50:09 2020
Bad Block Log : 512 entries available at offset 264 sectors
Checksum : aa48b164 - correct
Events : 37929
Layout : near=2
Chunk Size : 8K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
mdadm --examine /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Name : ourserver :0 (local to host ourserver )
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Raid Devices : 4
Avail Dev Size : 3906762944 (1862.89 GiB 2000.26 GB)
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
Data Offset : 264176 sectors
Super Offset : 8 sectors
Unused Space : before=263896 sectors, after=976752016 sectors
State : active
Device UUID : 6d0134e3:1358acfd:9c862967:aec370c2
Update Time : Sun May 10 16:22:39 2020
Bad Block Log : 512 entries available at offset 264 sectors
Checksum : df218e12 - correct
Events : 97380
Layout : near=2
Chunk Size : 8K
Device Role : Active device 1
Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
mdadm --examine /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Name : ourserver :0 (local to host ourserver )
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Raid Devices : 4
Avail Dev Size : 2930012925 (1397.14 GiB 1500.17 GB)
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 2930010928 (1397.14 GiB 1500.17 GB)
Data Offset : 264176 sectors
Super Offset : 8 sectors
Unused Space : before=263896 sectors, after=1997 sectors
State : clean
Device UUID : b1dd6f8b:a8e4efa7:72b7f987:e71edeb2
Update Time : Mon Jun 8 17:33:23 2020
Bad Block Log : 512 entries available at offset 264 sectors
Checksum : 8da0376 - correct
Events : 898705
Layout : near=2
Chunk Size : 8K
Device Role : Active device 2
Array State : A.A. ('A' == active, '.' == missing, 'R' == replacing)
Can I try the --force
and --assemble
options as mentioned by this user or can I try the --replace
option mentioned here?
Edit: Now I'm seeing this after the resync:
mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 13 16:46:35 2020
Raid Level : raid10
Array Size : 2930010928 (2794.28 GiB 3000.33 GB)
Used Dev Size : 1465005464 (1397.14 GiB 1500.17 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Tue Jun 9 15:51:31 2020
State : clean, degraded
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 8K
Consistency Policy : resync
Name : ourserver:0 (local to host ourserver)
UUID : 88b9fcb6:52d0f235:849bd9d6:c079cfc8
Events : 1083817
Number Major Minor RaidDevice State
0 8 81 0 active sync set-A /dev/sdf1
4 8 33 1 active sync set-B /dev/sdc1
3 8 17 2 active sync set-A /dev/sdb1
- 0 0 3 removed
5 8 1 - spare /dev/sda1
cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sda1[5](S)(R) sdf1[0] sdb1[3] sdc1[4]
2930010928 blocks super 1.2 8K chunks 2 near-copies [4/3] [UUU_]
unused devices: <none>
Now I'm seeing this in the logs:
Jun 9 15:51:31 ourserver kernel: md: recovery of RAID array md0
Jun 9 15:51:31 ourserver kernel: md/raid10:md0: insufficient working devices for recovery.
Jun 9 15:51:31 ourserver kernel: md: md0: recovery interrupted.
Jun 9 15:51:31 ourserver kernel: md: super_written gets error=10
Jun 9 15:53:23 ourserver kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
And trying to fail /dev/sdb
results in:
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set device faulty failed for /dev/sdb1: Device or resource busy
How do I promote the spare drive and fail /dev/sdb?
You are effectively running without no redundancy and with a soon-to-be-failing disk.
Before doing anything, take backups! If you have many files to backup, I recommend to first take a block level copy of the failing disk via
ddrescue /dev/sdd </dev/anotherdisk>
where/dev/anotherdisk
is an additional disk (even an USB one).After having both file and block level backups, you can try to salvage the array by issuing the following command:
mdadm /dev/md0 --add /dev/sdb /dev/sdc
However, please strongly consider to totally recreate the array as you are using a very small chunk size (8K), which will severely impair performance (a good default chunk size is 512K).
UPDATE: I just noticed you further damaged the array with a forced assembly and setting
sda
as spare. Moreover, an extraneoussdf
appeared. By forcing the array assembly which such out-of-date disks, you probably lost any chances to recover the array. I strongly advise you to contact a proper specialist.