I wanted to replace a disk in my zpool by issuing the following command:
zpool replace -o ashift=12 pool /dev/mapper/transport /dev/mapper/data2
ZFS got to work and resilvered the pool. In the process, there were some read errors on the old disk, and after it finished, zpool status -v
looked like this:
pool: pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 6,30T in 147h38m with 6929 errors on Sat Feb 11 13:31:05 2017
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 16,0K
raidz1-0 ONLINE 0 0 32,0K
data1 ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
transport ONLINE 14,5K 0 0
data2 ONLINE 0 0 0
data3 ONLINE 0 0 0
logs
data-slog ONLINE 0 0 0
cache
data-cache ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<list of 3 files>
I expected the old disk to be detached from the pool, but it wasn't. I tried to detach it manually:
# zpool detach pool /dev/mapper/transport
cannot detach /dev/mapper/transport: no valid replicas
But when I exported the pool, removed the old drive, and imported the pool again, it seems to work flawlessly: It started resilvering again, but it is DEGRADED, not FAILED:
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Feb 11 17:28:50 2017
42,7G scanned out of 9,94T at 104M/s, 27h43m to go
1,68G resilvered, 0,42% done
config:
NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 9
raidz1-0 DEGRADED 0 0 18
data1 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
15119075650261564517 UNAVAIL 0 0 0 was /dev/mapper/transport
data2 ONLINE 0 0 0 (resilvering)
data3 ONLINE 0 0 0 (resilvering)
logs
data-slog ONLINE 0 0 0
cache
data-cache ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<list of 3 files>
Still, although it is clearly not necessary for full functionality of the pool, I cannot detach the old drive:
# zpool offline pool 15119075650261564517
cannot offline 15119075650261564517: no valid replicas
What is going on?
Update: Apparently, ZoL hadn't given up on the failing devices just yet. Replacing the 3 files with permanent errors (one of which was a zvol, meaning I had to create another one, dd conv=noerror
over the contents and destroy the old one) and letting the resilver finish finally removed the old drive.
I'd still be interested in what ZoL was thinking. I mean, everything that didn't cause read- or checksum-errors was copied over to the new device, and it had already marked the sectors that caused errors as permanently failed. So why hang on to the old device that ZoL clearly didn't intend to get any information from anymore?
similar situation here, and a partial resolution for reference only: