Ping a Specific Port

Question

tsc_chazz

Asked: 2023-04-25 04:39:29 +0800 CST2023-04-25 04:39:29 +0800 CST 2023-04-25 04:39:29 +0800 CST

RAID arrays failed, now will not restart; mdadm --examine shows drive healthy but --assemble fails missing two disks

772

This is a Mint 21.1 x64 Linux system, which has over the years had disks added to RAID arrays until we now have one array of 10 3TB and one array of 5 6TB. Four HDs dropped out of the arrays, two from each, apparently as a result of one controller failing. We've replaced controllers, but that has not restored the arrays to function. mdadm --assemble reports unable to start either array, insufficient disks (with two failed in each, I'm not surprised); mdadm --run reports I/O error (syslog seems to suggest this is because it can't start all the drives, but there is no indication that it tried to start the two apparently unhappy ones), but I can still mdadm --examine failed disks and they look absolutely normal. Here's output from a functional drive:

mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 6e072616:2f7079b0:b336c1a7:f222c711

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:30:27 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 2faf0b93 - correct
         Events : 21397

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 9
   Array State : AAAAAA..AA ('A' == active, '.' == missing, 'R' == replacing)

And here's output from a failed drive:

mdadm --examine /dev/sdk
/dev/sdk:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : d62b85bc:fb108c56:4710850c:477c0c06

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : d53202fe - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Edit: Here's the --examine report from the second failed drive; as you can see, it failed at the same time the entire array fell off line.

# mdadm --examine /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 35ebf7d9:55148a4a:e190671d:6db1c2cf

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : c13b7b79 - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 7
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The second array, 5x6TB, fell off line two minutes later when two disks quit. The two failed disks on this array, and the two on the other array, all connected to a single 4-port SATA controller card which of course has now been replaced.

The main thing I find interesting about this is that the failed drive seems to report itself as alive, but mdadm doesn't agree with it. journalctl doesn't seem to go back as far as 2 April, so I may not be able to find out what happened. Anyone have any ideas about what I can do to bring this beast back online?

1 Answers

Voted

Peter Zhabin · Answer 1 · 2023-04-26T15:33:17+08:00

Always make an image-level backups of all drives in the array before attempting any potentially destructive mdadm commands. With these backups at hand you can later attempt recovery on a VM outside the box.
Examine Update time field in for failed drives in the output of mdadm --examine /dev/sdX to determine exact sequence of events when drives were falling out of the array. Sometimes the first drive failure comes unnoticed and bringing that old drive online will result in a catastrophic failure while trying to mount a filesystem.
In your case both drives failed at once, so it should be safe to force array online with mdadm --assemble --force /dev/mdX or mdadm --assemble --force --scan. If it were not the case, you should force online only the last drive that fell off the array by specifying array member drives for mdadm --assemble --force /dev/mdX /dev/sda /dev/sdb missing /dev/sdd, note that the order of drives is important.
As you were able to get things going only with explicit device list for assemble I believe your array is currently in a degraded state with that /dev/sdh marked offline. Look into the output of cat /proc/mdstat to determine that, do a backup, troubleshoot your hardware and rebuild your array completely after that.

RAID arrays failed, now will not restart; mdadm --examine shows drive healthy but --assemble fails missing two disks

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?