Ping a Specific Port

Question

tomfanning

Asked: 2011-12-31 15:52:11 +0800 CST2011-12-31 15:52:11 +0800 CST 2011-12-31 15:52:11 +0800 CST

"zpool replace" hangs and locks up the pool

772

Given my four disk RAIDZ1, one disk has become physically noisy, not producing errors yet, but not sounding healthy either. So I've chosen to pre-emptively replace it.

I have done:

zpool offline tank0 ad6

Shutdown, remove and replace disk

zpool replace tank0 ad6 ad6

which hangs forever.

zpool status also hangs forever, as does zpool history.

If I reboot the machine with the disk removed, everything works fine, in degraded mode as expected.

What do I do now? Worried because my data is now vulnerable to a single disk failure.

OS is FreeBSD 7.3-RELEASE-p1 - a.k.a. FreeNAS 7.0.2.5226

I have just tried the same operation in a VM, albeit FreeBSD 7.3-RELEASE-p7 (FreeNAS 0.7.2.8191, slightly later version) - works perfectly. Trying with the oldest version of FreeNAS I can find now (7.0.2.5799) and will update later.

Also, does zpool replace require no filesystem usage? There's a possibility that another daemon on the NAS is using the filesystem. I assume this would be OK but this may of course be wrong.

Update, 2012-01-10

I booted the machine with FreeNAS 8 and did the zpool replace - which started, and immediately started throwing piles of data corruption errors and kernel panics - despite a weekly scrub of the pool never finding any issues. I don't think I did anything stupid like tell it to replace the wrong disk. I immediately issued shutdown -h since I know the data was just fine.

Anyhow, I now have a degraded pool, stuck in a state where the replace is suspended, and I am copying my data off to a 3TB external drive, bought at great expense, so I can destroy the pool and start again. Thankfully, the data looks okay - I happen to have md5sums of about 100GB of the files, which so far seem to be intact, and I have managed to recover everything that is truly irreplaceable.

I am now waiting for more RAM to arrive since FreeNAS 8 keeps panicing with kmem_max too small errors, which I don't seem to be able to tune around, and the machine was RAM constrained (1 GB RAM for a 4TB RAIDZ1).

Hard lesson learned about backups, but also confidence in ZFS/FreeNAS/FreeBSD really knocked.

UPDATE 13/1/12

Well my data appears to be safely backed up now.

zpool status -v hangs even with failmode set to continue. Here's the output of zpool status, with the new disk plugged in (ada1)

  pool: tank0
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the entire pool 
        from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAME                       STATE     READ WRITE CKSUM
tank0                      DEGRADED     0     0     0
  raidz1                   DEGRADED     0     0     0
    ada2                   ONLINE       0     0     0
    ada3                   ONLINE       0     0     0
    ada0                   ONLINE       0     0     0
    replacing              DEGRADED     0     0 3.17K
      6540253751212815210  UNAVAIL      0     0     0  was /dev/ada3/old
      ada1                 ONLINE       0     0     0

errors: 3130 data errors, use '-v' for a list

With the old disk plugged in instead of the new one, ZFS won't import the pool, and zfs status says:

tank0          UNAVAIL insufficient replicas
  raidz1       FAULTED corrupted data
    ada2       ONLINE
    ada3       ONLINE
    ada0       FAULTED corrupted data
    replacing  UNAVAIL insufficient replicas
      ada1     FAULTED corrupted data
      ada1     FAULTED corrupted data

I don't see why ada0 should be FAULTED with the new disk (ada1) plugged in but ONLINE with the old disk plugged in? I don't see how ada0 is even related.

Let's try to recover this pool as a learning exercise.

3 Answers

Voted

voretaq7 · Answer 1 · 2012-01-12T19:07:57+08:00

voretaq7

2012-01-12T19:07:57+08:002012-01-12T19:07:57+08:00

I'm not a ZFS guru but I'll take a shot: It sounds like the ZFS subsystem is still trying to access the failed drive, and hanging for some reason. Try setting the pool's failmode value to continue (zpool set failmode=continue) and see if that makes the hang go away and lets you suss out what's going on.

(Note that this isn't a fix: The system still can't access a drive it thinks it should be able to access, it's just telling it to return an error and keep going rather than blocking until an answer is received.)

1

tomfanning · Answer 2 · 2012-01-14T06:34:01+08:00

Best Answer

tomfanning

2012-01-14T06:34:01+08:002012-01-14T06:34:01+08:00

Was truly backed into a corner with this. Have ended up flattening the pool and restoring files from backup on to FreeNAS 8.

Feels far more stable so far - newer x64 OS, 4GB RAM probably all contributing.

1

Sean Reifschneider · Answer 3 · 2012-01-15T01:09:09+08:00

I recently had a situation that sounds similar, though I wasn't experiencing hangs, I was just unable to replace the failed drive. Of course, I was in a totally different environment: Linux with ZFS-fuse. However, unlike you I was not being told I had experienced data corruption, I was seeing:

 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
[...]
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1105B8GNNADX-part4  UNAVAIL      0     0     0  cannot open

Now, before going any further, it's important to realize that none of the data on this pool was irreplaceable, everything was either backed up or backups of other systems. If you do not have good backups of this data, you probably want to stop at this point and make raw copies of the discs before doing anything else, in case this makes it worse.

What I ended up doing that worked was this.

First I exported the pool with "zfs export POOLNAME". I then rebooted and did a "zfs import POOLNAME".

Now when I did a "zpool status", I got this:

state: DEGRADED
[...]
12640132412556728613                                         UNAVAIL      0     0     0  was /dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK1105B8GNNADX-part4

Now I was able to use the above number to replace the disc using:

zpool replace POOLNAME 12640132412556728613 /dev/DEVILCENAME

Now it showed up as replacing the drive in "zpool status":

 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 282h43m to go

It "only" took around 48 hours to run, not the estimated 282 above. :-)

"zpool replace" hangs and locks up the pool

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?