Ping a Specific Port

Question

phaidros

Asked: 2022-01-07 02:31:15 +0800 CST2022-01-07 02:31:15 +0800 CST 2022-01-07 02:31:15 +0800 CST

How to fix ZFS pool once spare replacement done or how to correct spare replacement

772

I have a ZFS pool in the current state:

[root@SERVER-abc ~]# zpool status -v DATAPOOL
  pool: DATAPOOL
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 18.5M in 00:00:01 with 0 errors on Wed Jan  5 19:10:50 2022
config:`

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0    17  too many errors
            spare-1                                       ONLINE       0     0    17
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    30
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    29
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87@auto-2022-01-04_11-41:<0x1>
        <0x1080a>:<0x1>
        <0x182a>:<0x1>
        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87:<0x1>
        <0x16fa>:<0x1>

This is a zpool with 4 + 1 spare drives. Something happened and suddenly the spare ist pairing automatically with the other drive as spare-1.

This is unexpected to me, as:

Why did the spare not replace the degraded drive?
How to find out why the spare jumped to spare-1?
Is it possible (or even recommended/possible) to get the spare back and then to replace the degraded drive?

Goal is to rescue the pool without having to get tons of data from the backup, but in core I want to understand what happened and why. And how to deal with those situations as in 'best practices'.

Tanks a bunch! :)

System is: SuperMicro, TrueNAS-12.0-U4.1, zfs-2.0.4-3

Edit: Changed output from zpool status -x to zpool status -v DATAPOOL

Edit2: As of now I understant that first 168342c5 seem to have an error and the spare (1bfaa607) jumped in. After that 14c707c6 degraded as well.

Edit3, Additional question: as all drives (except the one in spare-1) seem to have CKSUM errors - what does that indicate? Cabling? HBA? All drives are dying simultaneously?

Latest Update, after zpool clear and zpool scrub DATAPOOL it seems clear, that alot has happened and there is no way to rescue the pool:

  pool: DATAPOOL
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan  6 16:18:05 2022
        1.82T scanned at 1.55G/s, 204G issued at 174M/s, 7.82T total
        40.8G resilvered, 2.55% done, 12:44:33 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   156  too many errors
            spare-1                                       DEGRADED     0     0     0
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  DEGRADED     0     0   236  too many errors
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0  (resilvering)
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   182  too many errors
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   179  too many errors
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

I'll check all smart stats now.

1 Answers

Voted

ewwhite · Answer 1 · 2022-01-07T02:40:56+08:00

ewwhite

2022-01-07T02:40:56+08:002022-01-07T02:40:56+08:00

Is this a 4-disk RAIDZ2?

Did you choose that layout over ZFS mirrors?

Can you show the output of zpool status -v?

Please also run a zpool clear and follow the results/progress.

1

How to fix ZFS pool once spare replacement done or how to correct spare replacement

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?