The backups (via Bacula) of one of my servers (“A”) connected via IPSec (Strongswan on Debian testing) to a storage daemon (“B”) don't finish 95% of the times they run. What apparently happens, is:
- Bacula opens a TCP connection to the storage daemon's VPN IP. (A → B)
- Since the kernel setting
net.ipv4.ip_no_pmtu_disc=0
is set by default, the IP Don't Fragment bit is set in the plaintext packet. - When routing the packet into the IPSec tunnel, the DF bit of the payload is copied to the IP header of the ESP packet.
- After some time (often around 20 mins) and up to several gigabyte of data sent, a packet slightly larger than ESP packets before is sent. (A → B)
- As the storage daemon interface has a lower MTU than the one of the sending host, a router along the way sends an ICMP type 3, code 4 (Fragmentation Needed and Don't Fragment was Set) error to the host. (some router → A)
- Connection stalls, for some reason host A floods ~100 empty duplicate ACKs to B (within ~20 ms).
(The ICMP packets are reaching host A and there are no iptables rules in place that block ICMP.)
Possible reasons why this happens, that I can think of:
- Kernel bug (Debian 3.13.7-1)
- Linux' IPSec implementation intentionally ignores the PMTU message as a security measure since it is unprotected and would affect an existing SA. (seems to be valid behavior according to RFC 4301 8.2.1)
- Has to do something with PMTU Aging (RFC 4301 8.2.2)
What is the best way to fix this, without disabling PMTU discovery globally or lowering the interface MTU? Maybe clear the DF bit somehow like FreeBSD does with ipsec.dfbit=0?
You could try creating a rule in
iptables
to set the TCP MSS for the VPN-destined traffic to a lower value. But without a packet capture it's difficult to guess what's going on.If PMTU discovery in a VPN scenario fails this is typically a problem with the public IP addresses of the gateways or routers in between or filtered ICMP messages. MSS clamping is only a ugly workaround.