I am right now getting additional grey hair fighting a phenomenon concerning packet loss between machines on the Internet.
Check the diagram below. Note that whenever I use "SSH" I could use "HTTPS"; the same phenomenon occurs for that protocol.
A SSH server running Fedora 22 is on "Site A" (wine red). I never had any connection problems till "recently".
SSH connections to "Site A" from Amazon EC2 machines running Fedora 22 or Fedora 23 work perfectly well (hosts shown in green inside the "Amazon EC2" box)
SSH connections to "Site A" from "Site B", which is on the same AS, do not work from any Fedora system I tested (orange boxes). However they do work from a Windows 7 system using Putty
. The same (dual-boot) hardware is involved in both cases. "Site B" also has a firewall but that does not seem to play any role: I have tried to set up the connection directly from the FritzBox router and it still didn't work for Fedora but worked for Windows.
How does the problem manifest itself:
When you connect using SSH, there is an initial packet exchange going on (as shown by tcpdump). However, after 20 packets or so, the outgoing packets seem to not go anywhere anymore; no acknowledgements come back from Site A. You never get to the password prompt. A CTRL-C properly resets the connection, after which Linux still tries to send the packets that were never ACKed for a bit.
I suspect there is some problem at my ISP, in particular I suspect that the ISP performs suspect magic in order to implement the "fixed IP address" at Site B, which is the only thing that changed "recently".
However, I can't understand what would account for the fact an SSH connection works from Windows but not from Linux under the same conditions, network-wise. What should I be looking for?
Your packet trace shows:
Note its a 1900 sized byte length with a dont fragment option set on the packet. Typical MTUs tend to be between 1400-1500 bytes.
Your probably getting packet too big ICMP messages back but your dropping all ICMP traffic inbound at the site A firewall.
To test for this you'd have to do the packet trace on your firewall for icmp and tcp 22.
Make sure you permit ICMP packet too big messages inbound at site A.
Alternatively you could try setting the MTU on your Linux boxes at Site A to something under the size of your network MTU. I am hazarding a guess that on Fedora you have jumbo packets enabled but on Windows you do not.
After the suggestions of the dear commenters, I have looked to see whether an MTU problem could be the cause.
The following was found when trying to connect from "Site A" to "Site B" from a Fedora system. On a Windows system everything is working perfectly fine -- wireshark indicates that outgoing packets' length never exceeds 1158 byte, so the problem is not triggered there.
In brief, if I read this correctly:
It looks like I will have to open a ticket with the ISP (which is POST Telecom Luxembourg btw, in case someone googles for similar problems).
It also suggests a remediation. Force the MTU to SITE_A to 1000:
Indeed, this fixes the problem.
Reference info
Use
ping
to test MTU behaviour:where
COUNT=1
: "One ping only"MTUDS=do
: MTU discovery strategy is "prohibit fragmentation, even local one" i.e. set the 'DF' (don't fragment) bit (why is this 'do'? dunno). USE THIS.MTUDS=want
: MTU discovery strategy is "do PMTU discovery, fragment locally when packet size is large" i.e. set the 'DF' bit and fragment locallyMTUDS=dont
: MTU discovery strategy is "don't set the 'DF' bit", i.e. fragment as neededPPLSZ=1464
: ICMP ping packet payload size in byte.Use
tcpdump
to monitor all ICMP packets and packets from and to "Site A":This is a bit hard to read though.
Watch what the kernel thinks about the MTU to "Site A".
Note that a lower MTU than the default will get cached with a TTL of 600 seconds after the first failed ping.
Scenario
Suppose the maximum IP packet size in byte (i.e. the size of the Ethernet payload) is 1492 (this is the case on Amazon EC2), then an interesting ping payload size would be 1465, because the 28 byte used for the IP and ICMP header information would give 1493, one byte pas the maximum.
Then
ping -c 1 -M want -s 1465 $HOST_IP
does the following:On the first ping you get "Frag needed and DF set (mtu = 1492) 100% packet loss".
tcpdump
shows echo request part 1 (length 1493) going out and a router of the target network sending back an "ICMP unreachable" with the request to fragment down to MTU 1492. A cached entry with MTU=1492 appears in the kernel route cache.On subsequent pings you get "1 packets transmitted, 1 received".
tcpdump
shows echo request part 1 (length 1492) and echo request part 2 (length 21, offset 1472) and the corresponding echo reply (length 1493).Or you can use traceroute
Packet size 1500. Traceroute tells us that route 10.10.80.7 has MTU 1492
Try with 1492: same problem!
Try with 1491: same problem!
Try with 1490: we get through. There is bound to be some off-by-one error in there.
Further info of interest: