We encounter some strange packet loss and want to know the reason for this.
We have an imageserver and a server for stressing the imageserver. Both are located in the same datacenter
First we run a load test like this (command shortened for readability):
ab -n 50 -c 5 http://testserver/img/de.png
The image has only about 300 Bytes. Results are very fast responses:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 1 3 0.7 3 4
Waiting: 1 3 0.7 3 3
Total: 1 3 0.7 3 4
When we increase concurrency we see some lags (command shortened for readability):
sudo ab -n 500 -c 50 http://testserver/img/de.png
Results with concurrency 50:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 1
Processing: 2 35 101.6 12 614
Waiting: 2 35 101.6 12 614
Total: 3 36 101.7 12 615
so we see most requests are pretty fast, a few of them are pretty slow.
We dumped the whole network traffic with tcpdump and saw some strange retransmissions.
alt text http://vygen.de/screenshot1.png
this dump was taken on the imageserver!
so you see that the initial package (No. 306) containing the GET request is arriving on the imageserver, but it seems the package get lost after tcpdump has logged it. It seems to me that this package does not arrive at the tomcat image server.
the retransmission is triggered by the requesting server 200ms later and everything runs fine afterwards.
Do you know any reason why a package can get lost after it was received?
Our machines are both:
- Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
- 8 GB RAM
- Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
- Debian version 5.0.5
So we do not have any problems concerning memory or cpu load.
We had some problems with our nic controller a while ago. We handled it by using a different driver, we are using now r8168 instead of r8169
But we had the same problems of lost packets with an Intel NIC - Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)
So we see the same problem with equal machines but different ethernet cards.
Till now i thought packet loss would only happen between servers on the line, when the packet gets corrupted or things like that.
We really want to know what reasons there might be for those packet loss after tcpdump has logged them.
Your help is very appreciated.
we found the root cause of this. We had an acceptCount of 25 in our tomcat server.xml.
acceptCount is documented like this:
But this is not the whole story about acceptCount. Short: acceptCount is the backlog Parameter when opening the socket. So this value is important for the listen backlog, even if not all threads are busy. It is important if request are faster coming in then tomcat can accept and delegate them to waiting threads. The default acceptCount is 100. This is still a small value to feed a sudden peak in requests.
We checked the same thing with apache and nginx and had the same strange packet loss but with higher concurrency values. The corresponding value in apache is ListenBacklog which defaults to 511.
BUT, with debian (and other linux based os) the default max value for the backlog paramter is 128.
So whatever you type in acceptCount or ListenBacklog it will not be over 128 until you change net.core.somaxconn
For a very busy webserver 128 is not enough. You should change it to something like 500, 1000 or 3000, depending on your needs.
After setting acceptCount to 1000 and net.core.somaxconn to 1000 we no longer had those dropped packets. (Now we have a bottleneck somewhere else, but this is another story..)
200ms is the default TCP retransmission timeout (RTO; described in RFC2988, though not with that minimum value defined).
So... something is being delayed or lost such that the RTO is being hit. Perhaps the packet was delayed such that the RTO was triggered, but wireshark smoothed over that during packet dissection / rendering? You should investigate the trace in more detail.
Can you provide a larger image of your packet capture?