Ping a Specific Port

Question

David Tonhofer

Asked: 2016-02-11 14:44:29 +0800 CST2016-02-11 14:44:29 +0800 CST 2016-02-11 14:44:29 +0800 CST

Packetloss over Internet for "Linux-Linux" but not for "Windows-Linux" (tl;dr: it's MTU)

772

I am right now getting additional grey hair fighting a phenomenon concerning packet loss between machines on the Internet.

Check the diagram below. Note that whenever I use "SSH" I could use "HTTPS"; the same phenomenon occurs for that protocol.

A SSH server running Fedora 22 is on "Site A" (wine red). I never had any connection problems till "recently".

SSH connections to "Site A" from Amazon EC2 machines running Fedora 22 or Fedora 23 work perfectly well (hosts shown in green inside the "Amazon EC2" box)

SSH connections to "Site A" from "Site B", which is on the same AS, do not work from any Fedora system I tested (orange boxes). However they do work from a Windows 7 system using Putty. The same (dual-boot) hardware is involved in both cases. "Site B" also has a firewall but that does not seem to play any role: I have tried to set up the connection directly from the FritzBox router and it still didn't work for Fedora but worked for Windows.

How does the problem manifest itself:

When you connect using SSH, there is an initial packet exchange going on (as shown by tcpdump). However, after 20 packets or so, the outgoing packets seem to not go anywhere anymore; no acknowledgements come back from Site A. You never get to the password prompt. A CTRL-C properly resets the connection, after which Linux still tries to send the packets that were never ACKed for a bit.

I suspect there is some problem at my ISP, in particular I suspect that the ISP performs suspect magic in order to implement the "fixed IP address" at Site B, which is the only thing that changed "recently".

However, I can't understand what would account for the fact an SSH connection works from Windows but not from Linux under the same conditions, network-wise. What should I be looking for?

Here (Amazon S3) is the tcpdump of a failing SSH connection

2 Answers

Voted

Matthew Ife · Answer 1 · 2016-02-14T14:57:57+08:00

Matthew Ife

2016-02-14T14:57:57+08:002016-02-14T14:57:57+08:00

Your packet trace shows:

22:29:22.180852 IP (tos 0x0, ttl 64, id 52989, offset 0, flags [DF], proto TCP (6), length 1900)   
SITE_B_LAN_ADDR.54358 > SITE_A.SSH_PORT: Flags [P.], cksum 0x05c4 (incorrect -> 0xadce), seq 22:1870, ack 22, win 229, options [nop,nop,TS val 4294917498 ecr 71539420], length 1848

Note its a 1900 sized byte length with a dont fragment option set on the packet. Typical MTUs tend to be between 1400-1500 bytes.

Your probably getting packet too big ICMP messages back but your dropping all ICMP traffic inbound at the site A firewall.

To test for this you'd have to do the packet trace on your firewall for icmp and tcp 22.

Make sure you permit ICMP packet too big messages inbound at site A.

Alternatively you could try setting the MTU on your Linux boxes at Site A to something under the size of your network MTU. I am hazarding a guess that on Fedora you have jumbo packets enabled but on Windows you do not.

2

David Tonhofer · Answer 2 · 2016-02-14T15:01:35+08:00

After the suggestions of the dear commenters, I have looked to see whether an MTU problem could be the cause.

The following was found when trying to connect from "Site A" to "Site B" from a Fedora system. On a Windows system everything is working perfectly fine -- wireshark indicates that outgoing packets' length never exceeds 1158 byte, so the problem is not triggered there.

Here (Amazon S3) is the annotated tcpdump of a problem with fragmentation

In brief, if I read this correctly:

There is an initial successful exchange of small packets.
A packet with length 1900 is sent. I suppose the network card will break this up because the MTU for the local network is 1500.
A router in the ISP network with address 10.10.80.7 tells us to "please fragment the packet to MTU 1492".
Wilco! A packet with length 1492 is sent.
A router in the ISP network with address 10.10.80.7 tells us to "please fragment the packet to MTU 1492".
Things go downhill from here.

It looks like I will have to open a ticket with the ISP (which is POST Telecom Luxembourg btw, in case someone googles for similar problems).

It also suggests a remediation. Force the MTU to SITE_A to 1000:

ip route add $SITE_A_IP via $GATEWAY_IP dev $ETHDEV mtu lock 1000

Indeed, this fixes the problem.

Reference info

Use ping to test MTU behaviour:

ping -c $COUNT -M $MTUDS -s $PPLSZ $HOST

where

COUNT=1: "One ping only"
MTUDS=do: MTU discovery strategy is "prohibit fragmentation, even local one" i.e. set the 'DF' (don't fragment) bit (why is this 'do'? dunno). USE THIS.
MTUDS=want: MTU discovery strategy is "do PMTU discovery, fragment locally when packet size is large" i.e. set the 'DF' bit and fragment locally
MTUDS=dont: MTU discovery strategy is "don't set the 'DF' bit", i.e. fragment as needed
PPLSZ=1464: ICMP ping packet payload size in byte.

Use tcpdump to monitor all ICMP packets and packets from and to "Site A":

tcpdump -vvv -n -nn icmp or '(' host $SITE_A_IP ')'

This is a bit hard to read though.

Watch what the kernel thinks about the MTU to "Site A".

watch ip route get to $SITE_A_IP

Note that a lower MTU than the default will get cached with a TTL of 600 seconds after the first failed ping.

Scenario

Suppose the maximum IP packet size in byte (i.e. the size of the Ethernet payload) is 1492 (this is the case on Amazon EC2), then an interesting ping payload size would be 1465, because the 28 byte used for the IP and ICMP header information would give 1493, one byte pas the maximum.

Then ping -c 1 -M want -s 1465 $HOST_IP does the following:

On the first ping you get "Frag needed and DF set (mtu = 1492) 100% packet loss". tcpdump shows echo request part 1 (length 1493) going out and a router of the target network sending back an "ICMP unreachable" with the request to fragment down to MTU 1492. A cached entry with MTU=1492 appears in the kernel route cache.

On subsequent pings you get "1 packets transmitted, 1 received". tcpdump shows echo request part 1 (length 1492) and echo request part 2 (length 21, offset 1472) and the corresponding echo reply (length 1493).

Or you can use traceroute

# traceroute --mtu SITE_A 1500

Packet size 1500. Traceroute tells us that route 10.10.80.7 has MTU 1492

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1500 byte packets
 1  gateway (192.168.10.1)  0.550 ms  0.536 ms  0.393 ms
 2  192.168.178.1 (192.168.178.1)  1.458 ms  1.485 ms  1.344 ms
 3  10.10.80.7 (10.10.80.7)  4.889 ms F=1492  2.968 ms  4.854 ms
 4  10.10.80.7 (10.10.80.7)  4.955 ms !F-1492  3.559 ms !F-1492  5.022 ms !F-1492

Try with 1492: same problem!

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1492 byte packets
 1  gateway (192.168.10.1)  0.635 ms  0.554 ms  0.483 ms
 2  192.168.178.1 (192.168.178.1)  1.510 ms  1.504 ms  1.311 ms
 3  10.10.80.7 (10.10.80.7)  48.305 ms  17.436 ms  5.496 ms
 4  10.10.80.7 (10.10.80.7)  5.963 ms !F-1492  6.865 ms !F-1492  4.887 ms !F-1492

Try with 1491: same problem!

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1491 byte packets
 1  gateway (192.168.10.1)  0.594 ms  0.650 ms  0.492 ms
 2  192.168.178.1 (192.168.178.1)  1.716 ms  1.782 ms  1.580 ms
 3  10.10.80.7 (10.10.80.7)  7.327 ms  7.385 ms  4.775 ms
 4  10.10.80.7 (10.10.80.7)  5.210 ms !F-1492  5.624 ms !F-1492  4.841 ms !F-1492

Try with 1490: we get through. There is bound to be some off-by-one error in there.

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1490 byte packets
 1  gateway (192.168.10.1)  0.616 ms  0.688 ms  0.484 ms
 2  192.168.178.1 (192.168.178.1)  1.712 ms  1.853 ms  1.611 ms
 3  10.10.80.7 (10.10.80.7)  6.248 ms  7.008 ms  4.995 ms
 4  SITE_A_IP.dyn.luxdsl.pt.lu (SITE_A_IP)  12.441 ms !X  9.641 ms !X  9.576 ms !X

Further info of interest:

Packetloss over Internet for "Linux-Linux" but not for "Windows-Linux" (tl;dr: it's MTU)

Reference info

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?