Ping a Specific Port

Question

Stephen C

Asked: 2011-08-15 16:57:35 +0800 CST2011-08-15 16:57:35 +0800 CST 2011-08-15 16:57:35 +0800 CST

Transient network dropout for Xen DomU's

772

We've got a CentOS server running a cluster of virtuals. Occasionally the cluster's internal network drops out for a minute or so ... and then comes back. The problem is somehow related to the actual network traffic, but it is not a simple load issue. (The system is generally lightly loaded, and the problem occurs irrespective of actual load.)

The setup:

CentOS 5.6 on Dom0, various CentOS on the DomU's
Hardware - a Dell R710 with a BroadCom NextXpress 2 NIC (sigh)
- using the latest drivers for the NIC from BroadCom
Xen configured to use network-bridge and vif-bridge
Some iptable tweaks to route an unrelated port to one of the virtuals.

The system has one externally visible IP address, and Dom0 runs an Apache httpd configured with a number of virtual hosts each of which reverse proxies to web servers running on the virtuals. (The virtuals have to be NAT'ed, primarily because we don't have enough allocated public IP addresses.)

The symptoms:

Works fine most of the time.
When someone tries to UPLOAD a large file to one virtuals, the internal network drops out ... for all virtuals:
- The Dom0 httpd sees a network timeout talking to the backend server on the virtual and reports a 502.
- A previously established ssh connection from Dom0 to any of the DomU's freezes.
- Our monitoring shows ping failures for traffic between virtuals.
- The Xen consoles to the DomU's do not freeze.
- No log messages in any log files that I can see, on either Dom0 or the DomU's ... apart from the Dom0 httpd logs.
- After a minute or so, the problem clears by itself.

This is 100% reproducible.

What we've tried:

Downloading, building and installing the latest BNX2 driver on Dom0
Turning off MSI on the NIC - adding "options bnx2 disable_msi=1" to /etc/modprobe.conf
Turning off tcp segmentation offload - "ethtool -K eth0 tso off".
Sacrificing a black rooster at midnight.

I've exhausted all my options apart from switching to KVM ... or slaughtering more roosters.

Any suggestions?

4 Answers

Voted

Stephen C · Answer 1 · 2012-09-06T06:44:54+08:00

Best Answer

Stephen C

2012-09-06T06:44:54+08:002012-09-06T06:44:54+08:00

We eventually did find the problem. It turned out to be a caused by a problem in our virtual network configuration. For some reason that I can no longer remember exactly, network traffic for that particular download was taking an extra loop through the virtual networks. When a user tried to upload a large file, the download was tying down all available kernel network buffers. That was causing the entire network to freeze ... until something timed out and it all unjammed.

I'm sorry that this is all a bit vague, but there may offer some hints for people who run into a similar problem.

1

U4iK_HaZe · Answer 2 · 2011-08-15T18:23:40+08:00

U4iK_HaZe

2011-08-15T18:23:40+08:002011-08-15T18:23:40+08:00

Perhaps there are a limited number of network threads to connect the virtuals to the host, and uploading large files takes up all of them eventually and the rest of them lose signal. I've got no other guesses. Sorry.

0

user48838 · Answer 3 · 2011-08-16T00:21:25+08:00

user48838

2011-08-16T00:21:25+08:002011-08-16T00:21:25+08:00

You might take a look at memory overcommitment and/or swap configuration(s). If the either is "tuned to the hilt" then a large file upload may be the trigger to the management of those resources - leading to the unavailability until the management is completed.

0

snap · Answer 4 · 2011-08-23T02:07:09+08:00

snap

2011-08-23T02:07:09+08:002011-08-23T02:07:09+08:00

Are you sure you do not have MAC address conflicts?

This is just a wild guess, but it happens easily if one copies Xen domU config files but forgets to change the MAC to be unique in each domU and interface. I have seen this cause strange network problems where all connectivity was lost for exactly 60 seconds occasionally.

0

Transient network dropout for Xen DomU's

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?