I have a single node Kubernetes cluster running on RHEL 7.
I also have a Windows Server 2019 server.
Both the Windows and RHEL servers are virtual machines on the same host.
When I sit at a command prompt on RHEL and run curl
to fetch a 500kb document from a URL on IIS, the request is "fast" (less than 1 second).
When I run the same request from inside a container running in a Kubernetes pod, the request is "slow" (4 seconds or more).
This happens with both Calico (original) and Weave (now deployed instead) as the Kubernetes pod network provider.
I've got as far as running tcpdump
inside a container and establishing that there are a large number of TCP retransmissions and window size updates during the course of the HTTP request.
This looks (to my limited knowledge) like an MTU related problem. However, reducing the MTU at both the IIS end and within the Weave network has not helped.
I am waiting for packet dumps from the customer run at both the IIS end and directly on the RHEL machine, so I can establish where packets are being dropped.
Meanwhile, any ideas very welcome.
We have cured the problem, though we never were 100% sure of the root cause.
Packet dumps showed jumbo frames (way bigger than 1500 bytes) arriving at the K8s box from IIS and then being rejected with "fragmentation needed" by Linux, since the Weave MTU was a standard 1376
The MTU at both ends of the link was 1500 but we think perhaps TCP segmentation offloading was in play (the customer uses VMWare and Mysterious “fragmentation required” rejections from gateway VM sounds somewhat related)
We ended up setting a very high MTU on the Weave network - 65404 - on the basis that it's all within a single VM, so why not?
This cured the packet fragmentation and HTTP requests from inside the containers are now just as quick as from outside on the K8s host.