Ping a Specific Port

Question

ewwhite

Asked: 2011-03-17 05:07:25 +0800 CST2011-03-17 05:07:25 +0800 CST 2011-03-17 05:07:25 +0800 CST

Nexenta/OpenSolaris filer kernel panic/crash

772

I've an x4540 Sun storage server running NexentaStor Enterprise. It's serving NFS over 10GbE CX4 for several VMWare vSphere hosts. There are 30 virtual machines running.

For the past few weeks, I've had random crashes spaced 10-14 days apart. This system used to open OpenSolaris and was stable in that arrangement. The crashes trigger the automated system recovery feature on the hardware, forcing a hard system reset.

Here's the output from mdb debugger:

panic[cpu5]/thread=ffffff003fefbc60: 
Deadlock: cycle in blocking chain


ffffff003fefb570 genunix:turnstile_block+795 ()
ffffff003fefb5d0 unix:mutex_vector_enter+261 ()
ffffff003fefb630 zfs:dbuf_find+5d ()
ffffff003fefb6c0 zfs:dbuf_hold_impl+59 ()
ffffff003fefb700 zfs:dbuf_hold+2e ()
ffffff003fefb780 zfs:dmu_buf_hold+8e ()
ffffff003fefb820 zfs:zap_lockdir+6d ()
ffffff003fefb8b0 zfs:zap_update+5b ()
ffffff003fefb930 zfs:zap_increment+9b ()
ffffff003fefb9b0 zfs:zap_increment_int+68 ()
ffffff003fefba10 zfs:do_userquota_update+8a ()
ffffff003fefba70 zfs:dmu_objset_do_userquota_updates+de ()
ffffff003fefbaf0 zfs:dsl_pool_sync+112 ()
ffffff003fefbba0 zfs:spa_sync+37b ()
ffffff003fefbc40 zfs:txg_sync_thread+247 ()
ffffff003fefbc50 unix:thread_start+8 ()

Any ideas what this means?

Additional information. I don't believe I have any quotas enabled on the filesystem or at a per-user level.

========== Volumes and Folders ===========
NAME                    USED    AVAIL   REFER  MOUNTED QUOTA  DEDUP COMPRESS
syspool/rootfs-nmu-000  9.84G   195G    3.84G  yes     none   off   off
syspool/rootfs-nmu-001  79.5K   195G    1.16G  no      none   off   off
syspool/rootfs-nmu-002  89.5K   195G    2.05G  no      none   off   off
syspool/rootfs-nmu-003  82.5K   195G    6.30G  no      none   off   off
vol1/AueXXXch           33.9G   1.28T   23.3G  yes     none   on    on
vol1/CXXXG              8.72G   1.28T   6.22G  yes     none   on    on
vol1/CoaXXXuce          97.8G   1.28T   61.4G  yes     none   on    on
vol1/HXXXco             58.1G   1.28T   41.1G  yes     none   off   on
vol1/HXXXen             203G    1.28T   90.0G  yes     none   off   on
vol1/HXXXny             9.65G   1.28T   8.48G  yes     none   off   on
vol1/InXXXuit           2.03G   1.28T   2.03G  yes     none   off   on
vol1/MiXXXary           196G    1.28T   105G   yes     none   off   on
vol1/RoXXXer            45.5G   1.28T   28.7G  yes     none   off   on
vol1/TudXXXanch         6.06G   1.28T   4.54G  yes     none   off   on
vol1/aXXXa              774M    1.28T   774M   yes     none   off   off
vol1/ewXXXte            46.4G   1.28T   46.4G  yes     none   on    on
vol1/foXXXce            774M    1.28T   774M   yes     none   off   off
vol1/saXXXe             69K     1.28T   31K    yes     none   off   on
vol1/vXXXre             72.4G   1.28T   72.4G  yes     none   off   on
vol1/xXXXp              29.0G   1.28T   18.6G  yes     none   off   on
vol1/xXXXt              100G    1.28T   52.4G  yes     none   off   on
vol2/AuXXXch            22.9G   2.31T   22.9G  yes     none   on    on
vol2/FamXXXree          310G    2.31T   230G   yes     none   off   on
vol2/LAXXXty            605G    2.31T   298G   yes     none   off   on
vol2/McXXXney           147G    2.31T   40.3G  yes     none   off   on
vol2/MoXXXri            96.8G   2.31T   32.6G  yes     none   off   on
vol2/TXXXta             676G    2.31T   279G   yes     none   off   on
vol2/VXXXey             210G    2.31T   139G   yes     none   off   on
vol2/vmXXXe2            2.69G   2.31T   2.69G  yes     none   off   on

3 Answers

Voted

Matthew Ife · Answer 1 · 2011-03-20T15:16:56+08:00

Matthew Ife

2011-03-20T15:16:56+08:002011-03-20T15:16:56+08:00

I know nothing about this setup but,

ffffff003fefb820 zfs:zap_lockdir+6d () seems to indicate that the worker thread is locking the directory and then mutex_vector_enter tries to lock it too.

This all seems to stem from a situation that begins with updating quota. If its possible you might want to consider turning quotas off if they are unnecessary.

Its only a workaround rather than a fix and I have no idea if it'll work as expected! But might be worth a try.

2

Brad · Answer 2 · 2011-03-21T20:41:28+08:00

Brad

2011-03-21T20:41:28+08:002011-03-21T20:41:28+08:00

The stack trace references "userquota" which is not typically used by our customers. Note that it is separate from the file system quotas that you can also set. I encourage you to turn off user quotas if you can, especially since you think they are unnecessary, but also I encourage you to file a support ticket if you have a support contract. This can be sent from the Web GUI, which would then include diagnostics from your system in the ticket.

1

ewwhite · Answer 3 · 2011-05-24T07:16:12+08:00

Best Answer

ewwhite

2011-05-24T07:16:12+08:002011-05-24T07:16:12+08:00

This was resolved permanently by recreating all of the zpools under Nexenta. There was a lot of baggage carried along with the zpools as they were imported from an OpenSolaris installation. And while I imported and upgraded the pools and filesystems, the stability wasn't there until everything was rebuilt.

1

Nexenta/OpenSolaris filer kernel panic/crash

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?