We do most of our work on colocated servers in a datacenter over SSH. This means that we're connected to the boxes almost all day, 5 days a week. Intermittently, we'll see a lag between typing on the keyboard, and having the contents echo'd back to us on the shell. I started doing some digging, and I'm having trouble understanding the results; I'm also looking for next steps to look at. Earlier, I ran a wireshark trace against tcp.dstport == 22
, which seems to be where we have the majority of the problems. I did notice a large-ish (10-20 out of several thousand packets) that were TCP Retransmissions. I assume this is related to the lag issue we're seeing.
1) mtr to remote host
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 192.168.100.254 76.6% 454 0.5 0.5 0.3 4.7 0.4
2. 10.113.128.1 80.6% 454 17.3 130.8 5.7 6030. 726.7
3. 74.128.19.209 79.5% 454 9.7 25.8 6.7 1270. 133.2
4. 74.128.8.233 80.6% 454 8.5 31.9 6.6 1369. 150.6
5. 4.71.250.1 79.2% 454 1547. 50.5 14.7 1547. 194.1
6. 4.69.138.158 80.4% 454 20.1 29.7 15.4 1003. 104.5
7. 4.69.140.189 74.2% 454 16.2 28.6 15.0 920.0 85.5
8. 4.69.138.4 72.6% 454 17.0 41.2 15.5 821.6 81.7
9. ???
10. 216.26.190.9 79.4% 453 45.2 105.8 24.4 3008. 406.7
11. 216.26.162.162 90.7% 453 28.3 40.2 24.1 556.3 81.7
2) mtr to 192.168.100.254 (happening simultaneously to above mtr)
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 192.168.100.254 0.0% 591 0.8 0.4 0.3 6.9 0.5
First question: why does the top mtr suggest packet loss at 192.168.100.254, when the bottom one does not?
Second question: how can I determine better what might be causing this?
EDIT:
mtr to first host outside our network:
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. edge.networldalliance.local 18.1% 393 0.5 0.5 0.4 1.8 0.2
2. 10.113.128.1 0.0% 393 10.0 10.1 5.5 744.3 37.4
separate mtr to second host in the hop:
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. edge.networldalliance.local 87.9% 424 0.8 0.7 0.5 1.2 0.1
2. 10.113.128.1 0.0% 424 9.5 9.5 5.2 577.8 27.8
3. 74-128-19-209.dhcp.insightbb.com 0.0% 423 6.5 10.4 6.2 243.9 12.8
separate (again) mtr to third host in the hop:
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. edge.networldalliance.local 87.2% 440 0.6 0.7 0.4 2.2 0.3
2. 10.113.128.1 0.0% 439 6.4 10.9 5.6 991.8 47.2
3. 74-128-19-209.dhcp.insightbb.com 0.0% 439 8.5 13.3 6.5 744.3 35.6
4. 74.128.8.233 0.0% 439 7.9 23.6 6.3 493.8 47.2
Any suggestions based on this new data? I'm going to see about getting the router / firewall replaced.
Direct Answers
mtr sends pings (ICMP echo response) with incrementing IP TTL until it gets a response. 192.168.100.254 responds differently when responding to TTL-expiration conditions (low success) vs ICMP echo response (high success)
When you say "causing this", I assume you mean your laggy ssh sessions, instead of the weird mtr results... right? A couple of thoughts...
Run
mtr
directly to every host in the 11-hop path and see if you can find some interesting symptom starting at one of the hops; based on your firstmtr
, this may not be much more productive, but it's worth a shot. Also talk to the administrator of 192.168.100.254 to see if you guys can figure out why ICMP TTL-expired replies are getting hosed.Misc Thoughts
There are three general causes of network problems: packet loss, packet delay (queuing) or packet reordering. However, let's also remember that sometimes host-level issues contribute to your problem1.
Let's assume for the moment that the
192.168.100.x
vlan isn't where your problem is, and your topology looks like this:If you are not already ssh-ing from a windows machine to
HOST_A
, do so2. Now record your windows desktop3. When the problem happens again, the recorded video is a very good audit trail for where your problems might be (i.e. either in the network, on hosts, or a combination of both). If you can somehow seentp
time in this video, all the better... this gives you a way to backtrack analysis throughsyslog
as well.END-NOTES
HOST_A
andHOST_B
, another for a sniffing session onHOST_A
, the last two should be runningtop
orvmstat 5
onHOST_A
andHOST_B
.To your second question: perhaps you can let ping run for a few hours to each of the hops you detected. Redirect the output to log files. Then extract the ping time with grep,awk,etc and plot it (Excel, OO Calc, etc). You should be able to see at which hops the lag starts.
What kind of Internet connection do you have? Oftentimes, upload saturation is suspect when you're dealing with high latency. Configure your router (or new router) to transmit at 85%-90% of maximum connection speed and setup a fair queuer on it to avoid ssh packets ending up at the end of the queue.