Ping a Specific Port

Question

Pavel

Asked: 2012-05-13 08:03:14 +0800 CST2012-05-13 08:03:14 +0800 CST 2012-05-13 08:03:14 +0800 CST

Solaris 11 hangs randomly: need help to figure out the reason

772

I've got Solaris 11 (+latest SRU) running on an HP DL385 G7 (attached to P2000 storage w. 30 disks; they are registered as separate RAID0 drives, but I'm using ZFS' raidz1), which is our file server. Every couple of days, the system freezes and needs to be restarted. There is nothing special in the logs and fmdump.

I ended up with a cron job dumping various statistics every 2 minutes to the hard drive which show that there is a load increase and memory decrease just before the crash:

$ grep load top.120512*

top.120512063601:last pid: 21751;  load avg:  0.61,  2.30,  2.93;  up 4+17:03:45        06:36:02
top.120512063800:last pid: 21765;  load avg:  0.27,  1.62,  2.59;  up 4+17:05:44        06:38:01
top.120512064000:last pid: 21779;  load avg:  0.29,  1.17,  2.30;  up 4+17:07:45        06:40:02
top.120512064200:last pid: 21793;  load avg:  0.56,  0.97,  2.09;  up 4+17:09:44        06:42:01
top.120512064400:last pid: 21807;  load avg:  0.20,  0.71,  1.85;  up 4+17:11:45        06:44:02
top.120512064600:last pid: 21821;  load avg:  0.60,  0.66,  1.68;  up 4+17:13:45        06:46:02
top.120512064800:last pid: 21835;  load avg:  1.25,  0.87,  1.64;  up 4+17:15:44        06:48:01
top.120512065000:last pid: 21851;  load avg:  4.77,  2.35,  2.10;  up 4+17:17:45        06:50:02
top.120512065200:last pid: 21864;  load avg:  5.10,  3.20,  2.45;  up 4+17:19:45        06:52:02
top.120512065400:last pid: 21878;  load avg:  5.81,  4.16,  2.91;  up 4+17:21:44        06:54:01
top.120512065601:last pid: 21892;  load avg:  5.26,  4.53,  3.20;  up 4+17:23:45        06:56:02
top.120512065800:last pid: 21906;  load avg:  5.36,  4.79,  3.46;  up 4+17:25:45        06:58:02

// here was the crash

top.120512163801:last pid:   701;  load avg:  1.18,  0.29,  0.10;  up 0+00:01:16        16:38:02
top.120512164000:last pid:  1456;  load avg:  0.36,  0.33,  0.14;  up 0+00:03:16        16:40:02
top.120512164200:last pid:  1470;  load avg:  0.14,  0.26,  0.14;  up 0+00:05:16        16:42:02
top.120512164400:last pid:  1499;  load avg:  0.39,  0.35,  0.19;  up 0+00:07:15        16:44:01
top.120512164600:last pid:  1513;  load avg:  0.10,  0.26,  0.17;  up 0+00:09:16        16:46:02

Or grep Memory:

top.120512064600:Memory: 16G phys mem, 2031M free mem, 2048M total swap, 2048M free swap
top.120512064800:Memory: 16G phys mem, 2047M free mem, 2048M total swap, 2048M free swap
top.120512065000:Memory: 16G phys mem, 1443M free mem, 2048M total swap, 2048M free swap
top.120512065200:Memory: 16G phys mem, 1313M free mem, 2048M total swap, 2048M free swap
top.120512065400:Memory: 16G phys mem, 892M free mem, 2048M total swap, 2048M free swap
top.120512065601:Memory: 16G phys mem, 418M free mem, 2048M total swap, 2048M free swap
top.120512065800:Memory: 16G phys mem, 294M free mem, 2048M total swap, 2044M free swap

// restart

top.120512163801:Memory: 16G phys mem, 14G free mem, 2048M total swap, 2048M free swap

or grep trap:

top.120512064800:Kernel: 50542 ctxsw, 13 trap, 113144 intr, 850 syscall, 9 flt    
top.120512065000:Kernel: 76357 ctxsw, 9 trap, 199203 intr, 399 syscall, 9 flt   
top.120512065200:Kernel: 72294 ctxsw, 13 trap, 254779 intr, 481 syscall, 9 flt    
top.120512065400:Kernel: 87671 ctxsw, 11 trap, 256663 intr, 401 syscall, 11 flt    
top.120512065601:Kernel: 72696 ctxsw, 11 trap, 281765 intr, 402 syscall, 11 flt    
top.120512065800:Kernel: 77316 ctxsw, 458 trap, 272329 intr, 412 syscall, 450 flt    
// restarted here    
top.120512163801:Kernel: 1570 ctxsw, 10 trap, 2380 intr, 1741 syscall, 9 flt

this one is from echo "::memstat" | mdb -k:

top.120512064800:ZFS File Data             2898132             11320   69%    
top.120512065000:ZFS File Data             3039466             11872   73%    
top.120512065200:ZFS File Data             3081508             12037   74%    
top.120512065400:ZFS File Data             3188175             12453   76%    
top.120512065601:ZFS File Data             3309405             12927   79%    
top.120512065800:ZFS File Data             3393392             13255   81%    
// restart    
top.120512163801:ZFS File Data               70094               273    2%    
top.120512164000:ZFS File Data               93547               365    2%    
top.120512164200:ZFS File Data              197571               771    5%    
top.120512164400:ZFS File Data             1175965              4593   28%    
top.120512164600:ZFS File Data             1205865              4710   29%    
top.120512164800:ZFS File Data             2537072              9910   61%

The ZFS pool is not corrupted, the actual load is below the average (compared to our other file servers), the hardware seems to be ok, too.

What do you think what might be the reason for this behavior? What other statistics do I need to collect?

Edit:

$ zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h6m with 0 errors on Wed Apr 25 14:40:49 2012
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c3t0d0s0  ONLINE       0     0     0
            c3t1d0s0  ONLINE       0     0     0

errors: No known data errors

  pool: volume
 state: ONLINE
  scan: resilvered 285G in 2h57m with 0 errors on Mon May  7 22:01:38 2012
config:

        NAME                                       STATE     READ WRITE CKSUM
        volume                                     ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            c0t600C0FF00012FBB1F749674F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DDDA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1DEA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D7CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1EAA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DEBA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1F0A1154F01000000d0  ONLINE       0     0     0
          raidz1-1                                 ONLINE       0     0     0
            c0t600C0FF00012FC7DFCA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1FDA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D08A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB109A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D14A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB115A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D20A2154F01000000d0  ONLINE       0     0     0
          raidz1-2                                 ONLINE       0     0     0
            c0t600C0FF00012FBB171A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D2CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB12DA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D38A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB139A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D44A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DA3CA754F01000000d0  ONLINE       0     0     0
          raidz1-3                                 ONLINE       0     0     0
            c0t600C0FF00012FC7D50A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB151A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D5CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB15DA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D68A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB169A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D70A2154F01000000d0  ONLINE       0     0     0
        spares
          c0t600C0FF00012FBB1D7A1154F01000000d0    AVAIL   
          c0t600C0FF000131E9277AD154F01000000d0    AVAIL   

errors: No known data errors

$ zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     24.4G   249G  39.5K  /rpool
rpool/ROOT                14.1G   249G    31K  legacy
rpool/ROOT/solaris        5.59M   249G  11.5G  /
rpool/ROOT/solaris-1      14.1G   249G  11.5G  /
rpool/ROOT/solaris-1/var  2.15G   249G  1.94G  /var
rpool/ROOT/solaris/var    2.71M   249G  1.29G  /var
rpool/dump                8.24G   250G  7.98G  -
rpool/export                63K   249G    32K  /export
rpool/export/home           31K   249G    31K  /export/home
rpool/swap                2.06G   249G  2.00G  -
volume                    8.77T  33.6T  6.77T  /volume
volume/gluster            33.5G  1.97T  33.5G  /volume/gluster

Edit 2

Here is a diff of various statistics: http://diffchecker.com/k19ZP458 (left: "normal" system state, right: just a minute before the crash)

1 Answers

Voted

ewwhite · Answer 1 · 2012-05-13T12:15:55+08:00

You haven't provided all of the details that would be helpful in diagnosing this problem.

How long have these crashes been occurring? (e.g. Is this a new installation or not?)
Could you show the output of zpool history or explain if you're using compression or deduplication in your storage pool?
How are you connected to the P2000? (fibre, SAS, iSCSI)
What HBA cards are installed in the DL385 G7?
What is the disk layout of the P2000? (it's a SAN, so it's not ideal for a ZFS solution)
Do you any HP Management Agents for Solaris installed?
Is your system firmware up-to-date?
Is the ILO configured? Have you examined its log? How is the health of the RAM?
Do you have the Automatic Server Recovery watchdog configured in the BIOS? That would trigger a restart after the system crashes. It would also help determine if this is a hardware or software issue.

So I ask about when these issues started because if this is a new installation, you have some options. From the look of your disk layout, this is intended to be a hefty storage system based on ZFS. There are several red flags in the setup, though...

One, you're exposing multiple vdisks from your SAN to ZFS. Basically, you have 30 individual RAID0 arrays defined in the P2000 SAN and are presenting those to Solaris. If you lose a disk, you'll need a reboot to recognize the new device.

Two, the choice of OS may be a problem as HP does not certify or fully support Solaris 11 on ProLiant systems yet. If this is purely a storage unit and you're not running any Solaris-specific software, NexentaStor would be a safe solution that supports the server hardware. I build most of my ZFS storage solutions on HP hardware. Even OpenIndiana would be a bit easier to support.

But if you need to troubleshot the real crashes, we'll need to know what's happening on the system. There could potentially be logs leading up to the crash. You may also have core files that could be of use. The method of connection to the SAN matters a bit, too, since I've seen odd NIC issues with Solaris and HP/Broadcom devices. That said, I'm betting this is networking related...

Solaris 11 hangs randomly: need help to figure out the reason

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?