We host our web-service on a dedicated server. Sometimes (I'd say 1 out of 20) a response is not received from the server. That makes the browser fallback with time-out error.
An important detail: the request is not logged by Apache in this case. The server is not loaded, there are a lot of free memory and CPU power left.
I have profiled the problem case with tcpdump utility. These are the "good" and "bad" sessions traced by tcpdump. The request is the same in both experiments. Good - server returns response. Bad - no response, time-out error.
Do you see why the problem happens from these data? How can I move further to get closer to the source of the error?
I've replaced my real ip address with 123.45.67.890
---- Bad ----
12:23:36.366292 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:39.362394 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:45.365567 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,nop,sackOK>
--------
---- Good ----
12:27:07.632229 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620946 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620969 IP myserver.superbservers.com.www > 123.45.67.890.63914: S 2654770980:2654770980(0) ack 3581365571 win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 6>
12:27:10.838747 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 1 win 4380
12:27:10.957143 IP 123.45.67.890.63914 > myserver.superbservers.com.www: P 1:213(212) ack 1 win 4380
12:27:10.957152 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 213 win 108
12:27:10.965543 IP myserver.superbservers.com.www > 123.45.67.890.63914: P 1:630(629) ack 213 win 108
12:27:10.965621 IP myserver.superbservers.com.www > 123.45.67.890.63914: F 630:630(0) ack 213 win 108
12:27:11.183540 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 631 win 4222
12:27:11.185657 IP 123.45.67.890.63914 > myserver.superbservers.com.www: F 213:213(0) ack 631 win 4222
12:27:11.185663 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 214 win 108
--------
Details on the service.
This is a weather reporting service. It is written in Perl, backed by MySQL. The script uses several modules (from CPAN and our own).
The code is relatively simple. The script downloads the weather from another server, converts data format and returns XML response. The weather is cached in MyISAM DB. There is a world locations data-base (INNODB) that can also be requested via the script.
Hosting: SuperbHosting OS: Ubuntu
Try using tcpdump or wireshark to monitor the network traffic. That way at least you will know if there's a networking issue. I.e. check if the request hits the machine at all.
Also, by default most browsers have limited (2) number of connections which can done to one and the same server. If your page has some javascript objects which "forget" to close a connections, etc., it might be that the browser never actually sends the request.
Can you try your request using only IP addresses? If so, this may help narrow down the problem.
Are all the requests coming from the same location, which have the problem? If so, try another location, perhaps a laptop in a Starbucks or something. If it happens from more than one location, using different browsers, on a very simple page without AJAX or complicated Javascript, that is valuable information.
If using the IP address works reliably, then it is likely DNS. Knowing the domain name in use may help narrow it down.
I'd go with Michael Gaff and then put some money on the hosting company - these kinds of traffic problems very easily occur with failing patch panels, nics, nic driver issues or bad cabling, amongst a thousand other infrastructure things.
I'm counting on you having tried this from different locations (or have reports from other places with the same problems) and gotten the same problem regardless so we can rule out a problem at your end, correct?
I'm a hardware freak so, I tend to lean towards hardware failures as the cause for weird software and network issues and mass destruction in general.
The problem was a large number of open TCP connections, a new connections was dropped occasionally because of this.