Given my four disk RAIDZ1, one disk has become physically noisy, not producing errors yet, but not sounding healthy either. So I've chosen to pre-emptively replace it.
I have done:
zpool offline tank0 ad6
Shutdown, remove and replace disk
zpool replace tank0 ad6 ad6
which hangs forever.
zpool status
also hangs forever, as does zpool history
.
If I reboot the machine with the disk removed, everything works fine, in degraded mode as expected.
What do I do now? Worried because my data is now vulnerable to a single disk failure.
OS is FreeBSD 7.3-RELEASE-p1 - a.k.a. FreeNAS 7.0.2.5226
I have just tried the same operation in a VM, albeit FreeBSD 7.3-RELEASE-p7 (FreeNAS 0.7.2.8191, slightly later version) - works perfectly. Trying with the oldest version of FreeNAS I can find now (7.0.2.5799) and will update later.
Also, does zpool replace
require no filesystem usage? There's a possibility that another daemon on the NAS is using the filesystem. I assume this would be OK but this may of course be wrong.
Update, 2012-01-10
I booted the machine with FreeNAS 8 and did the zpool replace
- which started, and immediately started throwing piles of data corruption errors and kernel panics - despite a weekly scrub of the pool never finding any issues. I don't think I did anything stupid like tell it to replace the wrong disk. I immediately issued shutdown -h
since I know the data was just fine.
Anyhow, I now have a degraded pool, stuck in a state where the replace is suspended, and I am copying my data off to a 3TB external drive, bought at great expense, so I can destroy the pool and start again. Thankfully, the data looks okay - I happen to have md5sums of about 100GB of the files, which so far seem to be intact, and I have managed to recover everything that is truly irreplaceable.
I am now waiting for more RAM to arrive since FreeNAS 8 keeps panicing with kmem_max too small errors, which I don't seem to be able to tune around, and the machine was RAM constrained (1 GB RAM for a 4TB RAIDZ1).
Hard lesson learned about backups, but also confidence in ZFS/FreeNAS/FreeBSD really knocked.
UPDATE 13/1/12
Well my data appears to be safely backed up now.
zpool status -v hangs even with failmode set to continue. Here's the output of zpool status, with the new disk plugged in (ada1)
pool: tank0
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool
from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank0 DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada0 ONLINE 0 0 0
replacing DEGRADED 0 0 3.17K
6540253751212815210 UNAVAIL 0 0 0 was /dev/ada3/old
ada1 ONLINE 0 0 0
errors: 3130 data errors, use '-v' for a list
With the old disk plugged in instead of the new one, ZFS won't import the pool, and zfs status
says:
tank0 UNAVAIL insufficient replicas
raidz1 FAULTED corrupted data
ada2 ONLINE
ada3 ONLINE
ada0 FAULTED corrupted data
replacing UNAVAIL insufficient replicas
ada1 FAULTED corrupted data
ada1 FAULTED corrupted data
I don't see why ada0 should be FAULTED with the new disk (ada1) plugged in but ONLINE with the old disk plugged in? I don't see how ada0 is even related.
Let's try to recover this pool as a learning exercise.
I'm not a ZFS guru but I'll take a shot: It sounds like the ZFS subsystem is still trying to access the failed drive, and hanging for some reason. Try setting the pool's
failmode
value tocontinue
(zpool set failmode=continue
) and see if that makes the hang go away and lets you suss out what's going on.(Note that this isn't a fix: The system still can't access a drive it thinks it should be able to access, it's just telling it to return an error and keep going rather than blocking until an answer is received.)
Was truly backed into a corner with this. Have ended up flattening the pool and restoring files from backup on to FreeNAS 8.
Feels far more stable so far - newer x64 OS, 4GB RAM probably all contributing.
I recently had a situation that sounds similar, though I wasn't experiencing hangs, I was just unable to replace the failed drive. Of course, I was in a totally different environment: Linux with ZFS-fuse. However, unlike you I was not being told I had experienced data corruption, I was seeing:
Now, before going any further, it's important to realize that none of the data on this pool was irreplaceable, everything was either backed up or backups of other systems. If you do not have good backups of this data, you probably want to stop at this point and make raw copies of the discs before doing anything else, in case this makes it worse.
What I ended up doing that worked was this.
First I exported the pool with "zfs export POOLNAME". I then rebooted and did a "zfs import POOLNAME".
Now when I did a "zpool status", I got this:
Now I was able to use the above number to replace the disc using:
Now it showed up as replacing the drive in "zpool status":
It "only" took around 48 hours to run, not the estimated 282 above. :-)