Our webservers with static content are experiencing strange 3 second latencies occasionally. Typically, an ApacheBench run (> 10000 requests, concurrency 1 or 40, no difference, but keepalive off) looks like this:
Connection Times (ms) min mean[+/-sd] median max Connect: 2 10 152.8 3 3015 Processing: 2 8 34.7 3 663 Waiting: 2 8 34.7 3 663 Total: 4 19 157.2 6 3222 Percentage of the requests served within a certain time (ms) 50% 6 66% 7 75% 7 80% 7 90% 9 95% 11 98% 223 99% 225 100% 3222 (longest request)
I have tried many things: - Apache2 2.2.9 with worker or prefork MPM, no difference (with KeepAliveTimeout 10-15) - Nginx 0.6.32 - various tcp parameters (net.core.somaxconn=3000, net.ipv4.tcp_sack=0, net.ipv4.tcp_dsack=0) - putting the files/DocumentRoot on tmpfs - shorewall on or off (i.e. empty iptables or not) - AllowOverride None is on for /, so no .htaccess checks (verified with strace) - the problem persists whether the webservers are accessed directly or through a Foundry load balancer
Kernel is 2.6.32 (Debian Lenny backports), but it occurred with 2.6.26 also. IPv6 is enabled, but not used.
Does the issue look familiar to anyone? Help/suggestions are much appreciated. It sounds a bit like a SYN,ACK packet getting lost or ignored.
Capture this event with tcpdump/Wireshark/tshark. Then open the capture in Wireshark, go to Statistics->TCP stream graph->Time-sequence graph (Stevens).
This gets you a graph of sequence numbers vs time. If you have a 3 second gap in your connections, you should be able to spot it, as there should be no dots for the 3 seconds on the x-axis in between two dense groupings of dots. Click on the last dot on the left side of the gap. This takes you to the frame just before the gap happens. Usually that's the one packet containing the problem. You might see zero-window packet, packet missing, out of order delivery, dups, etc...
Check if your DNS server is slow, and set your Apache log files so that they log by IP not by domain name. If you don't change the default log file setting, every time you get a request, the logger has to do a DNS lookup.
This can be caused by IO locks in many interesting ways. To start with, try to isolate the problem. Is the problem the server/network, or is it the service? Can you replicate the problem with ping/tcpping?
If it's a problem where the whole server hangs for a few seconds.
Are your hard-disks set to spin down on inactivity? If you get a page-fault on a HD that is spun down the system can take seconds to recover. Either way, consider getting rid of swap.
It can be a low level problem with the network. I have seen similar behaviour with rare, slow, connections when a Switch ran out of space in the MAC address table. Do some packet traces and see if you can see something else that seems related on the network.
It can also be a HW problem with the server, such as a bus that locks up and recovers after a few seconds. Check your logs.
If seems to only be the Apache:
DNS lookups would be a common culprit, but you seem to have that one covered.
Try rolling out a completely different server (like lighttp) and see if that gets you around the problem. Then you can start suspecting something in your apache configuration.
Sounds like a problem with TCP connection establishment, i.e. a lost SYN,ACK just as you suggest.
3 seconds is the default first timeout for TCP SYN,ACK on Linux. It is unlikely to be application (webserver) related as connection establishment is handled by the kernel.
Since it affects less than 1% of connections, some things it could be are:
I had this recently on a server and it turned out to be the second one above: misconfigured NIC, which had been forced to the wrong speed and duplex settings. I reset it to autonegotiate with ethtool and haven't looked back.