Ping a Specific Port

Question

nh2

Asked: 2024-11-28 06:41:02 +0800 CST2024-11-28 06:41:02 +0800 CST 2024-11-28 06:41:02 +0800 CST

Slow upload speeds from Windows on single TCP connection, likely due to packet reordering of Intel 82599ES (ixgbe) 10 Gbit/s NIC cards

772

Between servers windows-in-Finland <-> linux-in-Germany I am experiencing 100x slower upload than download (windows -> linux is 100x slower than windows <- linux).

Details and existing research

I originally observed this problem with Windows clients across the world, and noticed that I can reproduce it also across controlled datacenter environments.

For reproducing the problem, I'm using the datacenter provider Hetzner, with the Windows machine being in Finland (dedicated server, Windows Server 2019), uploading to both of:

Linux Hetzner dedicated Germany: slow
Linux Hetzner Cloud VM Germany: fast

Both of them are in the same datacenter park and thus both have 37 ms ping from the Windows machine. While the connection between Finland and Germany is usually on Hetzner's private network, it is currently being re-routed via public Internet routes due to the C-LION1 2024 Baltic Sea submarine cable disruption (Hetzner status message about it), so the connection "simulates" using normal public Internet routes and peerings.

I'm measuring with iperf3, windows <- linux:

C:\Users\Administrator\Downloads\iperf3.17.1_64\iperf3.17.1_64>iperf3.exe -c linux-germany-dedicated.examle.com

Connecting to host linux-germany-dedicated.examle.com, port 5201
[  5] local 192.0.2.1 port 62234 connected to 192.0.2.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  15.8 MBytes   132 Mbits/sec
[  5]   1.00-2.00   sec  1.88 MBytes  15.7 Mbits/sec
[  5]   2.00-3.00   sec  1.38 MBytes  11.5 Mbits/sec
[  5]   3.00-4.00   sec  1.75 MBytes  14.7 Mbits/sec
[  5]   4.00-5.00   sec  2.25 MBytes  18.9 Mbits/sec
[  5]   5.00-6.00   sec  2.88 MBytes  24.1 Mbits/sec
[  5]   6.00-7.00   sec  3.25 MBytes  27.3 Mbits/sec
[  5]   7.00-8.00   sec  3.38 MBytes  28.3 Mbits/sec
[  5]   8.00-9.00   sec  2.75 MBytes  23.1 Mbits/sec
[  5]   9.00-10.00  sec  1.25 MBytes  10.5 Mbits/sec

More iperf3 observations:

The other direction (adding -R to iperf3) is much faster at ~900 Mbit/s. (Note that the Linux sides are using BBR congestion control, which likely helps that direction.)
When downloading with 30 connections (iperf3 with -P 30), the connection 1 Gbit/s connection is maxed out, suggesting that the problem is the upload throughput of a single TCP upload connection.
When replacing the Windows machine with a Linux one in Finland, both directions max out their 1 Gbit/s connection. This leads me to conclude that the involvement of Windows is at fault.
Note there is a Microsoft article claiming that iperf3 is the best for high-performance measurements on Windows. This is not relevant for this question because, it applies only to >= ~10 Gbit/s connections, and the fact that iperf3 across multiple Windows/Linux machines in the same datacenter proves that 1 Gbit/s speed is easily achievable with iperf3 in both directions.

In 2021 Dropbox released an article Boosting Dropbox upload speed and improving Windows’ TCP stack that points out Windows's incorrect (incomplete) handling of TCP retransmissions; Microsoft published Algorithmic improvements boost TCP performance on the Internet along with it.

That seems to largely explain it, and Wireguard slow but only for windows upload shows a potential solution, namely changing the number of RSS (Receive Side Scaling) queues to 1:

ethtool -L eth0 combined 1

This changes from 16 (16 threads on my dedicated Linux server) to 1, and increases the converged iperf3 upload speed from 10.5 to 330 Mbit/s.

That's nice, but it should be 1000 Mbit/s.

Especially odd: Testing windows -> linux-Hetzner-Cloud instead of windows -> Hetzner-dedicated, I observe perfect upload speeds:

C:\Users\Administrator\Downloads\iperf3.17.1_64\iperf3.17.1_64>iperf3.exe -c linux-germany-hcloud.example.com

Connecting to host linux-germany-hcloud.example.com, port 5201
[  5] local 192.0.2.1 port 55615 connected to 192.0.2.3 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   108 MBytes   903 Mbits/sec
[  5]   1.00-2.00   sec   112 MBytes   942 Mbits/sec
...
[  5]   9.00-10.00  sec   112 MBytes   942 Mbits/sec

This is odd, because the cloud machine has much lower specs. It has 8 virtual cores, but its ethtool -l output already defaults to Combined: 1 because being a VM, it does not support RSS at all:

root@linux-germany-hcloud ~ # ethtool -x enp1s0

RX flow hash indirection table for enp1s0 with 1 RX ring(s):
Operation not supported
RSS hash key:
Operation not supported
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

So somehow the weaker machine lacks the problem. Maybe there's some clever NIC hardware thing going on in the dedicated machine that creates the problem? What could it be?

I already tried disabling TCP Segment Offloading (ethtool -K eth0 tso off) but that does not affect the results. The feature that caused the problem for in the Dropbox article (flow-director-atr) is not available on my NIC, so that can't be it.

Question

What can explain the further 3x bottleneck in upload between comparing the two Linux servers?

How can I just get fast uploads from Windows?

More environment info

Both Linux machines use the same Linux version 6.6.33 x86_64 and same sysctls (ensured via NixOS), which are:

net.core.default_qdisc=fq
net.core.rmem_max=1073741824
net.core.wmem_max=1073741824
net.ipv4.conf.all.forwarding=0
net.ipv4.conf.net0.proxy_arp=0
net.ipv4.ping_group_range=0 2147483647
net.ipv4.tcp_congestion_control=bbr
net.ipv4.tcp_rmem=4096 87380 1073741824
net.ipv4.tcp_wmem=4096 87380 1073741824

Windows Server 2019 Version 1809 (OS Build 17763.6293)

Edit 1

I found that I get 950 Mbit/s upload from Windows to other Hetzner-dedicated machines. The dedicated machines to which the upload is slow all have in common that they have Intel 10 Gbit/s network cards; from lspci:

01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

lsmod | grep ixgbe suggests that the ixgbe driver is used here. ixgbe is also mentioned in the above Dropbox article. The paper "Why Does Flow Director Cause Packet Reordering?" they link mentions Intel 82599 specifically. I also found this e1000-devel thread where somebody mentions the problem in 2011, but no solution is presented.

When using the 1-Gbit Intel Corporation I210 Gigabit Network Connection (rev 03) card present in the same model of server, the issue is gone and I get 950 Mbit/s.

So there seems to be something specific about 82599ES/ixgbe that causes the issue.

Edit 2: Intel `Flow Director` and trying the out-of-tree `ixgbe`

Googling intel disable flowdirector produces https://access.redhat.com/solutions/528603 mentioning Intel 82599.

Helps:

Intel Flow Director is an Intel NIC and driver feature which provides intelligent and programmable direction of similar network traffic (i.e. a "flow") into specific receive queues.

By default, Flow Director operates in ATR (Application Targeted Receive) mode. This performs regular RSS-style hashing when previously-unseen traffic is received. However, when traffic is transmitted, that traffic's tuple (or "flow") is entered into the receive hash table. Future traffic received on the same tuple will be received on the core which transmitted it. The sending and receiving process can then be pinned to the same core as the receive queue for best CPU cache affinity.

Note that community research has shown that ATR can cause TCP Out-of-Order traffic when processes are migrated between CPUs. It is better to explicitly pin processes to CPUs when using ATR mode.

FlowDirector is mentioned in the Dropbox article, and so is ATR.

The mentioned "community research" is the same paper "Why Does Flow Director Cause Packet Reordering?" Dropbox refers to.

Doing the suggested

ethtool -K net0 ntuple on

improves the speed from 20 Mbit/s to 130 Mbit/s (with the default ethtool -L net0 combined 16). Running it for longer (iperf3 --time 30) makes it drop to 80 Mbit/s after 16 seconds. Using ntuple on together with combined 16 does not improve it further.

So this is not a complete solution.

Testing the options ixgbe FdirMode=0 approach next.

On ram256g-1:

rmmod ixgbe; modprobe ixgbe FdirMode=0; sleep 2; ifconfig net0 94.130.221.7/26 ; ip route add 192.0.2.2 dev net0 proto static scope link ; ip route add default via 192.0.2.2 dev net0 proto static ; echo done

dmesg shows

ixgbe: unknown parameter 'FdirMode' ignored

That is despite https://www.kernel.org/doc/Documentation/networking/ixgbe.txt documenting it:

FdirMode
--------
Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode)
Default Value: 1

  Flow Director filtering modes.

So 0=off seems even more desirable than the other two, which supposedly is what ntuple on/off switches between.

https://access.redhat.com/solutions/330503 says

Intel choose to expose some configurations as a module parameter in their SourceForge driver, however the upstream Linux kernel has a policy of not exposing a feature as a module option when it can be configured in ways already available, so you'll only see the some module parameters on Intel drivers outside the upstream Linux kernel tree.

Red Hat follow upstream kernel methods, so those options won't be in the the RHEL version of the driver, but the same thing can often be done with ethtool (and without a module reload).

This suggests that 0=off is not actually achievable.

Or maybe it will work with modprobe.d options but not the modprobe command?

Relevant code:

Old kernel with the FdirMode option:
- https://github.com/spotify/linux/blob/6eb782fc88d11b9f40f3d1d714531f22c57b39f9/drivers/net/ixgbe/ixgbe_param.c#L285C24-L285C37
- https://github.com/spotify/linux/blob/6eb782fc88d11b9f40f3d1d714531f22c57b39f9/drivers/net/ixgbe/ixgbe_param.c#L1034-L1040
New kernel without:
- https://github.com/torvalds/linux/blob/b86545e02e8c22fb89218f29d381fa8e8b91d815/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c#L648
  - Suggests that Flow Director is only enabled if RSS queue length is > 1
  - So probably setting the queue length to 1 with ethtool -L (which is --set-channels) should already achive it.

But it seems https://github.com/intel/ethernet-linux-ixgbe is still actively developed and supports all the old options. Also supports FdirPballoc which never existed in torvalds/linux. That is described in: https://forum.proxmox.com/threads/pve-kernel-4-10-17-1-wrong-ixgbe-driver.35868/#post-175787 Also related: https://www.phoronix.com/news/Intel-IGB-IXGBE-Firmware-Update Maybe I should try to build and load that?

From that driver, FDirMode was also removed:

From https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20160919/006629.html:

ethtool -K ethX ntuple on

This will enable "perfect filter" mode, but there are no filters yet, so the received packets will fall back to RSS.

Tried

ethtool -K net0 ntuple on
ethtool --config-ntuple net0 flow-type tcp4 src-ip 192.0.2.1 action 1

Did not improve speed.

I also found that the speed on Linux 6.3.1 seems to be 90 Mbit/s while it's 25 Mbit/s on 6.11.3.

Compiling the out-of-tree ethernet-linux-ixgbe on the Hetzner Rescue System Linux (old) which has 6.3.1 (there is no release yet for Linux 6.11):

wget https://github.com/intel/ethernet-linux-ixgbe/releases/download/v5.21.5/ixgbe-5.21.5.tar.gz
tar xaf *.tar.gz
cd ixgbe-*/src && make -j

# because disconnecting the ethernet below will hang all commands
# from the Rescue Mode's NFS mount if not already loaded into RAM
ethtool --help
timeout 1 iperf3 -s
dmesg | grep -i fdir

modinfo /root/ixgbe-*/src/ixgbe.ko  # shows all desired options

rmmod ixgbe; insmod /root/ixgbe-*/src/ixgbe.ko; sleep 2; ifconfig eth1 94.130.221.7/26 ; ip route add 192.0.2.2 dev eth1 scope link ; ip route add default via 192.0.2.2 dev eth1 ; echo done

iperf3 -s

This driver provides a more solid 450 Mbit/s out of the box.

rmmod ixgbe; insmod /root/ixgbe-*/src/ixgbe.ko FdirPballoc=3; sleep 2; ifconfig eth1 94.130.221.7/26 ; ip route add 192.0.2.2 dev eth1 scope link ; ip route add default via 192.0.2.2 dev eth1 ; echo done

dmesg | grep -i fdir

iperf3 -s

Brings no improvement.

Also try:

AtrSampleRate

A value of 0 indicates that ATR should be disabled and no samples will be taken.

rmmod ixgbe; insmod /root/ixgbe-*/src/ixgbe.ko AtrSampleRate=0; sleep 2; ifconfig eth1 94.130.221.7/26 ; ip route add 192.0.2.2 dev eth1 scope link ; ip route add default via 192.0.2.2 dev eth1 ; echo done

dmesg | grep -i atrsample

iperf3 -s

Brings no improvement.

ethtool -L net0 combined 1 brings no improvement here either, and

ethtool -K eth1 ntuple on
ethtool -L eth1 combined 12  # needed, otherwise the out-of-tree ixgbe driver complains with `rmgr: Cannot insert RX class rule: Invalid argument` when `combined 1` is set
ethtool --config-ntuple eth1 flow-type tcp4 src-ip 192.0.2.1 action 1

brings no improvement either.

Edit 3: Changed NIC

I changed the NIC of the Linux server from Intel 82599ES to Intel X710, which uses the Linux i40e driver.

The problem persisted.

I suspect it is because the X710, too, supports Intel Flow Director.

The partial mitigation of ethtool -L eth0 combined 1 has the same effect as for the 82599ES.

The command

ethtool --set-priv-flags eth0 flow-director-atr off

(which is possible for i40e but not ixgbe) mentioned by Dropbox as the workaround only achieved the same speedup as ethtool -L eth0 combined 1 (so to around 400 Mbit/s).

Interestingly, Hetzner reported that the Hetzner Cloud machines are also powered by the Intel X710, but they don't exhibit the problem.

1 Answers

Voted

nh2 · Answer 1 · 2024-12-11T12:06:36+08:00

I seems I found a solution to get full speed (caveats below):

ethtool -L eth0 combined 1
sysctl -w net.ipv4.tcp_sack=0

This

disables Intel FlowDirector in the NIC to the extent that it sets a single queue
disables TCP SACK support of the Linux kernel

I had tried the first line before (it brought a moderate), but disabling SACKs with the second line fully fixes the problem for me, producing full gigabit upload speed from the Windows machine.

Each line alone does not produce an improvement, but combined they fix it.

This fixes the problem for both our Intel X710 and the Intel 82599ES servers.

Caveat:

I believe a side effect of disabling SACKs is that if there is packet loss, the connection speed temporarily drops down more than with SACKs enabled. However, it does recover within a few seconds, and on average I observed successful full gigabit speed uploads from Windows Server 2019 with these settings. I will test it more in the Windows-10-over-the-Internet situation soon, as I have only tested Windows-Server-2019-HEL1 -> Linux-FSN1 so far.

Edit: I have now confirmed that sysctl -w net.ipv4.tcp_sack=0 is detrimental on long-range WAN connections that have packet loss. As soon as some packet loss happens, it maks the connection speed drop hard.

So doing the full-speed approach with both settings is only recommended on connections where you know packet loss is low. If you're receiving from the general Internet, it is better to only use

ethtool -L eth0 combined 1

and keep sysctl -w net.ipv4.tcp_sack=1. That will be more reliably over the Internet.

It makes some sense that disabling SACKs has an effect given that the Dropbox article (and quoted Microsoft statement) mentions SACKs as well.

I do not understand yet what causes my

Hetzner Cloud machines are also powered by the Intel X710, but they don't exhibit the problem

observation. Setting net.ipv4.tcp_sack on their VM hosts should have no effect because the Linux guest should be in control of TCP, not the host machine; the guest does have the tcp_sack=1 default and yet it works. Maybe there are more NIC settings that can achieve the same effect as disabling SACKs in the kernel that I haven't discovered yet, or maybe devices along the network path of dedicated servers cannot handle SACKs well, and those devices don't exist along the network path of Hetzner Cloud machines.

This article mentions network gear making problems with SACK, but it mentions mostly complete stalls of the connection as oppose to slowdowns:

When to turn TCP SACK off?

I also observed that Windows 11 (I tested 23H2) has this problem fixed. According to the Microsoft article in the question, Windows Server 2022 also has it fixed.