We host our web-service on a dedicated server. During high-load period, the server returns a time-out error instead of a page very frequently. We have around 170,000 daily requests. However, server has a lot of free memory and the CPU is not loaded at the moment.
I cannot understand why the server works poor.
I have profiled the problem case with tcpdump utility. These are the good and bad sessions traced by tcpdump. The request is the same in both experiments. Good - server returns response. Bad - no response, time-out error.
Do you see why the problem happens from these data? How can I move further to get closer to the source of the error?
I've replaced my real ip address with 123.45.67.890
---- Bad ----
12:23:36.366292 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:39.362394 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:45.365567 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,nop,sackOK>
--------
---- Good ----
12:27:07.632229 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620946 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620969 IP myserver.superbservers.com.www > 123.45.67.890.63914: S 2654770980:2654770980(0) ack 3581365571 win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 6>
12:27:10.838747 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 1 win 4380
12:27:10.957143 IP 123.45.67.890.63914 > myserver.superbservers.com.www: P 1:213(212) ack 1 win 4380
12:27:10.957152 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 213 win 108
12:27:10.965543 IP myserver.superbservers.com.www > 123.45.67.890.63914: P 1:630(629) ack 213 win 108
12:27:10.965621 IP myserver.superbservers.com.www > 123.45.67.890.63914: F 630:630(0) ack 213 win 108
12:27:11.183540 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 631 win 4222
12:27:11.185657 IP 123.45.67.890.63914 > myserver.superbservers.com.www: F 213:213(0) ack 631 win 4222
12:27:11.185663 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 214 win 108
--------
Details on the service.
This is a weather reporting service. It is written in Perl, backed by MySQL. The script uses several modules (from CPAN and our own).
The code is relatively simple. The script downloads the weather from another server, converts data format and returns XML response. The weather is cached in MyISAM DB. There is a world locations data-base (INNODB) that can also be requested via the script.
These are the metrics taken during the high-load period.
Average traffic: 2.5MBit/sec
Average packets number: 3300 packets/sec
Hoster: SuperbHosting
OS: Ubuntu
Server parameters: E6300 CONROE 1.86GHZ 2 X 1MB CACHE 1066 1GB DDR2 667MHZ
This is a link to apache configuration file we use http://repkin5.snow.prohosting.com/apache.txt
This is server-status report at the moment of high-load http://repkin5.snow.prohosting.com/server-status.htm There are only 10 Child Servers running out of 120, so enough space for new requests.
Top program snapshot during high-load period.
------
top - 13:21:29 up 15 days, 18:36, 1 user, load average: 0.18, 0.19, 0.21
Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.8%us, 1.2%sy, 0.0%ni, 92.8%id, 0.7%wa, 0.0%hi, 3.5%si, 0.0%st
Mem: 1033904k total, 590620k used, 443284k free, 6892k buffers
Swap: 3028212k total, 82556k used, 2945656k free, 64156k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4252 mysql 20 0 162m 48m 3352 S 1 4.8 279:01.27 mysqld
14503 www-data 20 0 43280 14m 3824 S 1 1.4 0:00.16 apache2
14577 www-data 20 0 43012 13m 3500 S 1 1.4 0:00.06 apache2
14401 www-data 20 0 45076 17m 4340 S 0 1.8 0:00.46 apache2
14414 www-data 20 0 45516 18m 4344 S 0 1.8 0:00.47 apache2
14420 www-data 20 0 45624 18m 4372 S 0 1.8 0:00.61 apache2
14421 www-data 20 0 45488 18m 4352 S 0 1.8 0:00.42 apache2
14496 www-data 20 0 44820 17m 4328 S 0 1.7 0:00.18 apache2
14510 www-data 20 0 45216 17m 4300 S 0 1.8 0:00.62 apache2
1 root 20 0 2844 456 404 S 0 0.0 0:05.24 init
2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT -5 0 0 0 S 0 0.0 0:00.24 migration/0
4 root 15 -5 0 0 0 S 0 0.0 32:28.85 ksoftirqd/0
5 root RT -5 0 0 0 S 0 0.0 0:00.77 watchdog/0
6 root RT -5 0 0 0 S 0 0.0 0:00.15 migration/1
7 root 15 -5 0 0 0 S 0 0.0 0:03.07 ksoftirqd/1
8 root RT -5 0 0 0 S 0 0.0 0:00.63 watchdog/1
-----
Your MySQL server seems to be working hard at times (but not in your screenshot), because of the high CPU time. Check which queries are run by issuing
SHOW PROCCESSLIST;
in the MySQL prompt.Also try to activate the server-status page in Apache. With
ExtendedStatus On
it will give you information about which pages are loading at that time and you can see how many open "slots" you have left in Apache.The default in Apache mpm-prefork is to have about 250 processes at most, meaning you can handle 250 simultaneous file downloads. This can cause a lot of trouble if you have many visitors, especially if you have
KeepAlive
activated. You will see in your server-status page if this is your problem.There are a whole host (<--pun) of reasons why Apache might be unresponsive. As a starting point I'd look at server-status assuming you have mod_status enabled and see how many processes are running and whether they look busy or not. This page is a good first place to look through things you could try if it turns out that you have child process issues.
The top screenshot shows your MySQL server is not working hard at all! (Sorry Emil - but you need to read that more carefully or learn some more about top).
And there's only 8 apache processes listed.
And the load average is 0.2 with 93% cpu idle - this machine is sitting doing nothing.
Half the config is missing from the link posted, but it looks like its pre-forked.
My first guess is that the MaxServers is set to some wacky value like 10 (depending on what your application looks like, it should be in the region of 100-200).
You might also want to check for stupid stuff like reverse lookups on incoming connections or mod_ident installed (OK these aren't always stupid - but should be used with extreme caustion).
Certainly installing mod_status and enabling the extended status will give a much better picture of what's going on here.
C.
Karl B is right apache status page will help you a lot. Check if the request is not logged in errorlog. It could be that the number of active request has reached the maximum number specified in apache configuration file. See if MaxClients is bigger than:
Can you try flood ping (ping -f) the server? I would guess there is some hardware problem regarding the network connection as you server does not seem to answer the SYN packets fast enough.