Ping a Specific Port

Question

user124381

Asked: 2012-06-13 05:26:18 +0800 CST2012-06-13 05:26:18 +0800 CST 2012-06-13 05:26:18 +0800 CST

Networking / Multipath dropping

772

I'm having problem with few linux boxes running xen. They are acting as hypervisors and they are connected to SAN using multipath setup to provide storage to guest vms.

Every now and then one of two paths fails but it can be quickly restored by running:

multipath
multipath -ll

I need to get to the bottom of the issue and find out why this is happening. I have noticed that this doesn't occur when the hypervisor is not too busy (network and I/O wise). I have also eliminated possible hardware problem by moving all the services on to identical new chassis. I have collected few system logs which may indicate NIC module issue or kernel problem and failing multipath might be only a result of this!!?? Here is a bit of log which always shows up when multipath goes down:

kernel: BUG: soft lockup - CPU#0 stuck for 60s! [swapper:0]
kernel: BUG: soft lockup - CPU#2 stuck for 60s! [events/2:76]

I'll paste full logs at the end of this post to keep it easy to read. Now a little bit more about my setup:

Internet access is setup over eth0 and eth2 (bonded)
SAN multipath access is setup over eth1 and eth3

Server:

Supermicro SuperServer 6016T-NTRF
Intel(R) Xeon(R) CPU E5645
Intel Corporation 82576 Gigabit Network
CentOS release 5.7 (Final) 2.6.18-274.18.1.el5xen
filename: /lib/modules/2.6.18-274.18.1.el5xen/kernel/drivers/net/igb/igb.ko
version: 3.0.6-k2-1
Log 01
Log 02

If anyone needs more details please get it touch. Any help will be much appreciated.

1 Answers

Voted

sysadmin1138 · Answer 1 · 2012-06-13T05:49:04+08:00

Since this appears to be an iSCSI setup, there are a couple of areas where path failovers can occur.

Simple Ethernet flakiness. A packet got dropped, which triggered the failover to the other path rather than wait for retransmit and reassembly.
Less simple Ethernet problems. A switch-port flipped briefly, triggering a failover.
Something in the Multipath stack triggered a failover. Multipath is more sensitive to network oddities than regular ole TCP/IP, so won't wait as long to reestablish connections; it'll fail over instead.
Something in the network stack went wrong. There are a few possibilities here, but from the looks of your error message this is likely the problem.

Multipath setups are very sensitive to latency on the wire, and iSCSI + Ethernet is going to have more of that than a Fibre Channel environment. Some flapping is going to be normal.

As this seems to happen when the HVM is busy, this suggests that the kernel NIC paths are getting either congested with data or starved for CPU (possibly both), which is triggering multipath failover. There isn't a lot you can do about that, but you can narrow things down so you can better explain why it is doing what it is doing.

Server load is pretty easy, and sounds like you've already done that.

Diagnosing congestion is harder. If your network port bandwidth monitors aren't showing a lot of traffic, but the log entries you posted happen anyway, that is a sign that the server is clogging up internally. If you can actually grab a packet capture during one of these events, the time-stamped packet counts will tell you if it really is seeing 10 second gaps in passed traffic; a sure sign that the server is internally clogged.

Fixing the problem is likely to be driver-specific, with a possibility of some tuning of the TCP/IP stack tuneables.

Networking / Multipath dropping

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?