I have a ZFS pool in the current state:
[root@SERVER-abc ~]# zpool status -v DATAPOOL
pool: DATAPOOL
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 18.5M in 00:00:01 with 0 errors on Wed Jan 5 19:10:50 2022
config:`
NAME STATE READ WRITE CKSUM
DATAPOOL DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e DEGRADED 0 0 17 too many errors
spare-1 ONLINE 0 0 17
gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e ONLINE 0 0 0
gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e ONLINE 0 0 0
gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e ONLINE 0 0 30
gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e ONLINE 0 0 29
spares
gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e INUSE currently in use
errors: Permanent errors have been detected in the following files:
DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87@auto-2022-01-04_11-41:<0x1>
<0x1080a>:<0x1>
<0x182a>:<0x1>
DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87:<0x1>
<0x16fa>:<0x1>
This is a zpool with 4 + 1 spare drives. Something happened and suddenly the spare ist pairing automatically with the other drive as spare-1.
This is unexpected to me, as:
- Why did the spare not replace the degraded drive?
- How to find out why the spare jumped to spare-1?
- Is it possible (or even recommended/possible) to get the spare back and then to replace the degraded drive?
Goal is to rescue the pool without having to get tons of data from the backup, but in core I want to understand what happened and why. And how to deal with those situations as in 'best practices'.
Tanks a bunch! :)
System is: SuperMicro, TrueNAS-12.0-U4.1, zfs-2.0.4-3
Edit: Changed output from zpool status -x to zpool status -v DATAPOOL
Edit2: As of now I understant that first 168342c5 seem to have an error and the spare (1bfaa607) jumped in. After that 14c707c6 degraded as well.
Edit3, Additional question: as all drives (except the one in spare-1) seem to have CKSUM errors - what does that indicate? Cabling? HBA? All drives are dying simultaneously?
Latest Update, after zpool clear
and zpool scrub DATAPOOL
it seems clear, that alot has happened and there is no way to rescue the pool:
pool: DATAPOOL
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Jan 6 16:18:05 2022
1.82T scanned at 1.55G/s, 204G issued at 174M/s, 7.82T total
40.8G resilvered, 2.55% done, 12:44:33 to go
config:
NAME STATE READ WRITE CKSUM
DATAPOOL DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e DEGRADED 0 0 156 too many errors
spare-1 DEGRADED 0 0 0
gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e DEGRADED 0 0 236 too many errors
gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e ONLINE 0 0 0 (resilvering)
gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e DEGRADED 0 0 182 too many errors
gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e DEGRADED 0 0 179 too many errors
spares
gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e INUSE currently in use
I'll check all smart stats now.