I have an SSL connection to a server (owner-api.teslamotors.com) that hangs with wget, curl or openssl s_client. I am showing the curl version as it gives the most debug messages:
# curl -iv https://owner-api.teslamotors.com
* Trying 18.203.70.200:443...
* Connected to owner-api.teslamotors.com (18.203.70.200) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
There it hangs after ClientHello. TCP connection establishes successfully (also confirmed with telnet/nc). Other network connection including any SSL connection I have tried works. Except owner-api.teslamotors.com:443.
I found this posting talking about MTU and it sounded far fetched. But I reduced the server MTU and it worked! It works with any MTU <= 1420.
The server connects using Ethernet (MTU 1500) to a Mikrotik router and from there the connection goes through a WireGuard tunnel (MTU 1420). I am aware that this may not be optimal as any IP packet from the server >1420 will need to be fragmented. However, this is agnostic of any L4 protocol. SSL over TCP should not care about fragmentation and MTU. Yet, this host does.
I ran a packet capture on the Mikrotik box and the traffic does not list anything abnormal to me:
The typical TCP handshake Num 1-3, then ClientHello (Num 4-5) and ServerHello (Num 6-7). No packet size comes close to the MTU and no other ICMP messages that would indicate issues with fragmentation etc.
Per comment, here is the tracepath output:
# tracepath owner-api.teslamotors.com
1?: [LOCALHOST] pmtu 1500
1: XX.XX.56.210 0.410ms
1: XX.XX.56.210 0.198ms
2: XX.XX.56.210 0.138ms pmtu 1420
2: XX.XX.56.185 151.394ms
3: no reply
4: 100.100.200.1 157.220ms
5: 10.75.0.193 154.068ms
6: 10.75.2.53 161.950ms asymm 7
7: decix1.amazon.com 152.107ms asymm 8
8: decix2.amazon.com 153.068ms
9: no reply
[...]
I am really lost what the heck is going on here.
Why does this one SSL connection fail?
The core of the problem is that the server doesn't know such packets need to be fragmented.
There are large packets – such as the TLS
Certificate
message from the server4 – but your capture is not seeing them because they are larger than your MTU, so they never arrive at your end. That is literally the problem; if those packets reached your network interface (such that they became visible on a packet capture), then the connection wouldn't hang.The capture needs to be done on the "upstream" end of the tunnel, specifically on the ingress interface that is one step before the 'low MTU' interface. So if the path is "internet → server eth0 ⇒ server wg0 → client wg-foo ⇒ client ether1", then the large packets will be visible on "server eth0" but won't fit into "server wg0". Capturing on wg0 would therefore give you nothing, but capturing on eth0 would likely show a series of:
(Note that hardware receive offload might give confusing results, as your Ethernet NIC might coalesce segments into one super-packet, e.g. when capturing on the end-host itself. If you see packets over 2kB in size, you may need to
ethtool -K eth0 gso off gro off
for the duration of the capture.)During the TCP handshake, the client (both peers really) declares a TCP MSS – maximum TCP segment size – that it can receive. Since the client usually has infinite memory nowadays1 and is not limited to tiny segments, it really offers the largest MSS which it calculates as optimal for the MTU that it knows, in order to avoid the need for IP-level fragmentation.
For example, if your Ethernet interface's MTU is 1500 then your OS might offer a MSS of 1460 which exactly fits within the IP payload (assuming IPv4 overhead of 20 and TCP overhead of 20 again, in the most simplest case).
So reducing the MTU of the client's network interface will lead to it declaring a smaller acceptable TCP segment size upfront, which causes the server to always send smaller IP packets (i.e. staying below the limit at which fragmentation would become required), just as if you had reduced the server's MTU.
With the default 1500 MTU, meanwhile, the server will send large segments in large IP packets, until it receives an ICMP "Fragmentation needed" from your ISP's gateway (the one that has the low-MTU link towards you and is unable to forward those packets to you); then the server will note the new PMTU towards you and will start sending those segments fragmented at IP level.5
But if any firewall prevents3 that ICMP error from reaching the server, this won't happen and the server will forever try sending that TCP segment in the same large IP packet. (Or, if the server is behind a certain type2 of firewall which reassembles and re-fragments all IP packets going through it, then it might be fragmenting the packet correctly but the firewall could be undoing all its work.)
Gateways, such as Linux with nftables/iptables, often have the feature to patch the advertised MSS of TCP handshakes going through them in order to fit the MTU that the gateway knows, e.g. when the client is on a 1500-byte MTU Ethernet but the gateway is about to forward the packet through an 1420-byte MTU PPPoE tunnel:
If my understanding is correct, TCP has to care about MTU, because relying on IP fragmentation reduces the efficiency of TCP retransmissions – if even a single fragment is 'lost' then the entire IP packet is 'lost' and none of it gets delivered to the upper layer protocol.
For example, if TCP sent a 64k segment that was fragmented into 45 IP datagrams and one of them got lost, then all of them would need to be retransmitted after the ICMP "Reassembly time exceeded". (This is assuming fragmentation works at all, which as you see sometimes doesn't.3)
Whereas with the same 64k of data divided into TCP segments that fit within the IP MTU, the other 44 IP packets would still be delivered to the recipient's TCP layer and SACK'ed and only the lost one would have to be retransmitted (which I think might even happen ~immediately after the server receives a SACK that indicates a hole, instead of a long reassembly timeout).
4 The 'Certificate' messages were visible in the clear with TLSv1.2, but are encrypted with TLSv1.3 so a capture will only see them as 'Application Data'.
1 Or so most developers assume.
2 Such as Untangle in its default "brouting" mode.
3 Also known as a "PMTUD black hole". See e.g. Cloudflare blog post #1 and #2 and #3 for one situation where it happens for reasons other than a sysadmin blanket-blocking ICMP.
5 I don't actually know whether it fragments the same segments at IP level or whether it reduces its TCP MSS for that connection as well. It might actually be the latter.
As explained in the other answer, the reason it hangs is because the server reply is bigger than the MTU and the ICMP error is not reaching the server.
The reason it is only happening on this server is because it is the only one sending that a big reply at this point.
If I curl https://owner-api.teslamotors.com
with 18.203.70.200 the ip address to which it resolved owner-api.teslamotors.com (after three CNAMEs, ending up in an AWS host, you will probably receive a different one) and 203.0.113.16 as the client.
Note that the Server Hello is a single TCP packet of 2958 bytes. That's clearly not going to fit the MTU.
wireshark identifies in it a
TLSv1.3 Record Layer: Handshake Protocol: Server Hello
of 122 bytes, followed by aTLSv1.3 Record Layer: Change Cipher Spec Protocol: Change Cipher Spec
of 6 bytes, which doesn't really identify what's going on.Setting
--tls-max 1.2
we get more readable data:We still get the 2958 bytes Server hello, but now it isn't encrypted, so we can 'read' it contains a certificate:
The packet with number 4478 is actually built from two packets, beginning in the previous one:
and its contents are:
followed by a
TLSv1.2 Record Layer: Handshake Protocol: Server Hello Done
which is just 9 bytes.So, the reason it is failing only on this server is that it tries to send a single TCP packet of nearly 3000 bytes (embedding a certificate of 3010 bytes), which isn't happening on other sites you have tried (there will be more that fail, though. Don't drop those ICMP errors, and use a matching MTU if possible).