We keep getting errors like this on our PHP sites.
Can't connect to MySQL server on '192.168.100.85' (4)
web is the web server 192.168.1.116
mysql is the mysql server 192.168.100.85
So I built a script on the web server to make 10,000 mysql connections in a row and time them. It will "sometimes" produce the error. Most times everything runs fine and there is an average time of 5-10 ms for the call to mysql_connect
Some serious Googling showed that the (4) error is due to the connection being cut off by the timeout.
web# grep mysql.connect_timeout /etc/php.ini
mysql.connect_timeout = 1
So I modified the timeout in the script to 30 to see if it would work. The connection errors went away, but occasionally the connection would take 5 seconds.
After more Googling and some tcpdump I found that occasionally when the MySQL server is doing its reverse lookup of the IP the DNS server would fail to respond. So after 5 seconds it would give up and allow the connection.
I have since added skip-name-resolve to the server. But this did not solve the problem.
Now my test would show slow connections taking 3-4.5 seconds instead of the set 5 with the DNS issue.
So I ran my tests again with tcpdump running on both ends.
web# tcpdump -n -s 65535 -w web3-$(date +"%F_%H-%M-%S").pcap host 192.168.100.85 and port 3306
mysql# tcpdump -n -s 65535 -w master1-$(date +"%F_%H-%M-%S").pcap host 192.168.1.116 and port 3306
Here is the packets from the relevant slow connection.
Packets from web:
No. Time Source Destination Protocol Info
13312 2010-10-13 10:01:01.201965 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904829062 TSER=0 WS=2
13316 2010-10-13 10:01:04.201577 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904832062 TSER=0 WS=2
13317 2010-10-13 10:01:04.204837 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1380 TSV=562240314 TSER=904832062 WS=7
13318 2010-10-13 10:01:04.204853 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13319 2010-10-13 10:01:04.205886 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13320 2010-10-13 10:01:04.205899 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13321 2010-10-13 10:01:04.205959 192.168.1.116 192.168.100.85 MySQL Login Request userexample
13322 2010-10-13 10:01:04.206800 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13323 2010-10-13 10:01:04.206874 192.168.100.85 192.168.1.116 MySQL Response OK
13324 2010-10-13 10:01:04.208823 192.168.1.116 192.168.100.85 MySQL Request Quit
13325 2010-10-13 10:01:04.208839 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317
13326 2010-10-13 10:01:04.210422 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.210437 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320
13328 2010-10-13 10:01:04.210567 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069
Packets from mysql:
No. Time Source Destination Protocol Info
13315 2010-10-13 10:01:04.204817 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1380 TSV=904832062 TSER=0 WS=2
13316 2010-10-13 10:01:04.204836 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=562240314 TSER=904832062 WS=7
13317 2010-10-13 10:01:04.206611 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13318 2010-10-13 10:01:04.206808 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13319 2010-10-13 10:01:04.207658 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13320 2010-10-13 10:01:04.207815 192.168.1.116 192.168.100.85 MySQL Login Request user=example
13321 2010-10-13 10:01:04.207872 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13322 2010-10-13 10:01:04.207910 192.168.100.85 192.168.1.116 MySQL Response OK
13323 2010-10-13 10:01:04.210817 192.168.1.116 192.168.100.85 MySQL Request Quit
13324 2010-10-13 10:01:04.210849 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13325 2010-10-13 10:01:04.211632 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317
13326 2010-10-13 10:01:04.211640 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.213243 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320
As you can see, web re-sent the initial ACK after 3 seconds of no response. But mysql never even saw that initial packet.
I also tried to run a ping flood to check for dropped packets. If you leave it running long enough you will get dropped packets.
web3# ping -f 192.168.100.85
PING 192.168.100.85 (192.168.100.85) 56(84) bytes of data.
....................................................................
--- 192.168.100.85 ping statistics ---
38253 packets transmitted, 38185 received, 0% packet loss, time 460851ms
rtt min/avg/max/mdev = 0.880/3.430/66.904/8.015 ms, pipe 7, ipg/ewma 12.047/1.378 ms
This issue is very intermittent but is happening all the time. I do understand that simply increasing our timeout would help greatly, but I'd rather have the connects always be fast and never have a 3-4 second delay in serving a page.
I contacted our hosting provider and they say its likely a result of the bad interaction between Nagle's Algorithm and Delayed ACK. http://www.stuartcheshire.org/papers/NagleDelayedAck/
It seems to me like packets are being dropped. Any ideas on a way I could better prove it to the hosting guys? Ping dropping 60 out of 38,000 packets does seem kind of minor. Is this something I should just live with?
Thank you for your time in looking into this!
I wouldn't try it with a ping flood, but I would set up a continous stream and let it run for, say, 2 minutes. That's a more valid test of network "loss" - sometimes ping -f can overrun particular pieces of equipment and give unreliable results.