I've an x4540 Sun storage server running NexentaStor Enterprise. It's serving NFS over 10GbE CX4 for several VMWare vSphere hosts. There are 30 virtual machines running.
For the past few weeks, I've had random crashes spaced 10-14 days apart. This system used to open OpenSolaris and was stable in that arrangement. The crashes trigger the automated system recovery feature on the hardware, forcing a hard system reset.
Here's the output from mdb debugger:
panic[cpu5]/thread=ffffff003fefbc60:
Deadlock: cycle in blocking chain
ffffff003fefb570 genunix:turnstile_block+795 ()
ffffff003fefb5d0 unix:mutex_vector_enter+261 ()
ffffff003fefb630 zfs:dbuf_find+5d ()
ffffff003fefb6c0 zfs:dbuf_hold_impl+59 ()
ffffff003fefb700 zfs:dbuf_hold+2e ()
ffffff003fefb780 zfs:dmu_buf_hold+8e ()
ffffff003fefb820 zfs:zap_lockdir+6d ()
ffffff003fefb8b0 zfs:zap_update+5b ()
ffffff003fefb930 zfs:zap_increment+9b ()
ffffff003fefb9b0 zfs:zap_increment_int+68 ()
ffffff003fefba10 zfs:do_userquota_update+8a ()
ffffff003fefba70 zfs:dmu_objset_do_userquota_updates+de ()
ffffff003fefbaf0 zfs:dsl_pool_sync+112 ()
ffffff003fefbba0 zfs:spa_sync+37b ()
ffffff003fefbc40 zfs:txg_sync_thread+247 ()
ffffff003fefbc50 unix:thread_start+8 ()
Any ideas what this means?
Additional information. I don't believe I have any quotas enabled on the filesystem or at a per-user level.
========== Volumes and Folders ===========
NAME USED AVAIL REFER MOUNTED QUOTA DEDUP COMPRESS
syspool/rootfs-nmu-000 9.84G 195G 3.84G yes none off off
syspool/rootfs-nmu-001 79.5K 195G 1.16G no none off off
syspool/rootfs-nmu-002 89.5K 195G 2.05G no none off off
syspool/rootfs-nmu-003 82.5K 195G 6.30G no none off off
vol1/AueXXXch 33.9G 1.28T 23.3G yes none on on
vol1/CXXXG 8.72G 1.28T 6.22G yes none on on
vol1/CoaXXXuce 97.8G 1.28T 61.4G yes none on on
vol1/HXXXco 58.1G 1.28T 41.1G yes none off on
vol1/HXXXen 203G 1.28T 90.0G yes none off on
vol1/HXXXny 9.65G 1.28T 8.48G yes none off on
vol1/InXXXuit 2.03G 1.28T 2.03G yes none off on
vol1/MiXXXary 196G 1.28T 105G yes none off on
vol1/RoXXXer 45.5G 1.28T 28.7G yes none off on
vol1/TudXXXanch 6.06G 1.28T 4.54G yes none off on
vol1/aXXXa 774M 1.28T 774M yes none off off
vol1/ewXXXte 46.4G 1.28T 46.4G yes none on on
vol1/foXXXce 774M 1.28T 774M yes none off off
vol1/saXXXe 69K 1.28T 31K yes none off on
vol1/vXXXre 72.4G 1.28T 72.4G yes none off on
vol1/xXXXp 29.0G 1.28T 18.6G yes none off on
vol1/xXXXt 100G 1.28T 52.4G yes none off on
vol2/AuXXXch 22.9G 2.31T 22.9G yes none on on
vol2/FamXXXree 310G 2.31T 230G yes none off on
vol2/LAXXXty 605G 2.31T 298G yes none off on
vol2/McXXXney 147G 2.31T 40.3G yes none off on
vol2/MoXXXri 96.8G 2.31T 32.6G yes none off on
vol2/TXXXta 676G 2.31T 279G yes none off on
vol2/VXXXey 210G 2.31T 139G yes none off on
vol2/vmXXXe2 2.69G 2.31T 2.69G yes none off on
I know nothing about this setup but,
ffffff003fefb820 zfs:zap_lockdir+6d () seems to indicate that the worker thread is locking the directory and then mutex_vector_enter tries to lock it too.
This all seems to stem from a situation that begins with updating quota. If its possible you might want to consider turning quotas off if they are unnecessary.
Its only a workaround rather than a fix and I have no idea if it'll work as expected! But might be worth a try.
The stack trace references "userquota" which is not typically used by our customers. Note that it is separate from the file system quotas that you can also set. I encourage you to turn off user quotas if you can, especially since you think they are unnecessary, but also I encourage you to file a support ticket if you have a support contract. This can be sent from the Web GUI, which would then include diagnostics from your system in the ticket.
This was resolved permanently by recreating all of the zpools under Nexenta. There was a lot of baggage carried along with the zpools as they were imported from an OpenSolaris installation. And while I imported and upgraded the pools and filesystems, the stability wasn't there until everything was rebuilt.