We have a fiber channel san managed by two OpenSolaris 2009.06 NFS servers.
- Server 1 is managing 3 small volumes (300GB 15K RPM drives). It's working like a charm.
- Server 2 is managing 1 large volume of 32 drives (2TB 7200 RPM drives) RAID6. Total size is 50TB.
- Both servers have Zpool version 14 and ZFS version 3.
The slow 50TB server was installed a few month ago and was working fine. Users filled-up 2TB. I did a small experiment (created 1000 filesystems and had 24 snapshots on each). Everything when well as far as creating, accessing the filesystems with snapshots, and NFS mounting a few of them.
When I tried destroying the 1000 filesystems, the first fs took several minutes and then failed reporting the fs was in use. I issued a system shutdown but took more than 10 minutes. I did not wait longer and shut the power off.
Now when booting, OpenSolaris hangs. The lights on the 32 drives are blinking rapidly. I left it for 24 hours - still blinking but no progress.
I booted into an system snapshot before the zpool was created and tryied importing the zpool.
pfexec zpool import bigdata
Same situation: LEDs blinking and the import hangs forever.
Dtracing the "zpool import" process shows only the ioctl system call:
dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }'
ioctl 2499
Is there a way to fix this? Edit: Yes. Upgrading OpenSolaris to svn_134b did the trick:
pkg publisher # shows opensolaris.org
beadm create opensolaris-updated-on-2010-12-17
beadm mount opensolaris-updated-on-2010-12-17 /mnt
pkg -R /mnt image-update
beadm unmount opensolaris-updated-on-2010-12-17
beadm activate opensolaris-updated-on-2010-12-17
init 6
Now I have zfs version 3. Bigdata zpool stays at version 14. And it's back in production!
But what was it doing with the heavy I/O access for more then 24 hours (before the software upgraded)?
With ZFS you really want to let it manipulate the disks directly as it improves concurrency. The single 50TB disk you gave it is a choking point.
That DTrace script is only tracking syscalls. The real action happens inside the kernel and if you want to see what's consuming most of the CPU, use the 'hotkernel' script from the DTrace Toolkit.
When you import a pool, ZFS reads the config from the disk and validates it. After the pool is imported, it'll start mounting all those 1000s filesystems and snapshots that you created. This can take a while. If you had dedup enabled (which you don't since you were using snv_111), it'd take even more time since it has to load the dedup table (DDT).
Shutting down the system is never a good option, specially on OpenSolaris snv_111. You haven't posted your pool configuration (zpool status) but, if you have slog devices and they fail, you won't be able to import the pool (this has been addressed recently in Solaris 11 Express snv_151a).
My advice is that you export each of the 32 disks individually and create multiple raidz2 vdev's so you have more read/write heads. Do not create huge vdev's with >8 disks because performance will be abysmal.
If you cannot afford to have the system down for so long (most people don't), study ZFS snapshots carefully and how to replicate them to a remote server with zfs send/receive. That will allow you to quickly bring a failover server up.
'zfs import' is more or less just reading back the configuration of your vdevs (from the 'zpool.cache') directly. I'm guessing what was taking forever here to finish, was your delete transaction.
Given that ZFS is transactional, and that you removed 1000 filesystems, each with 24 snapshots you had a very intensive delete with needed to check reference pointers to 24,000 snapshots. Given the seek time of those SATA heads, and all of the tree updates that needed to be done.