I have delays when sending data through a TCP channel I am not able to understand. The link is a 1Gb link with a end to end latency of roughly 40ms. In my current setup, latency (the time from one message to go from the sender user space to the receiver user space) can reach 100ms.
The sender socket is configured with the TCP_NODELAY option. Sender buffer (SO_SNDBUF) is configured to be 8MB. Receive buffer (SO_RCVBUF) is also configured to be 8MB. Tcp window scaling is activated.
update-1: I am using the zeromq 3.1.1 middleware to carry data. Socket configuration, including the TCP_NODELAY flag is performed by the middleware. Some options are accessible like rx and tx emit buffer sizes but not TCP_NODELAY. As far as I have understood, the TCP_NODELAY is activated to ensure that the data is sent as possible. In the meantime, actual socket sends and decision to send a message are performed in two separate threads. A proper batching is done if several messages are available at the time the first message in the batch is to be sent.
I ran a capture with tcpdump from which the frames below have been extracted. After the initial TCP handshake, the sender (172.17.152.124) starts sending data. The initial window size is 5840 bytes for the receiver & 5792 bytes for the sender.
My problem is that the sender sends two frames (#6 and #7) then stops, waiting for an ack to come back from the receiver. As far as I can see, the window size of the receiver is not reached and the transfer should not stop (384 bytes outstanding with an initial receive window size of 5840 bytes). I am starting to think that I have no understood correctly what TCP is. Can someone help clarifying ?
update-2: My data payload consists of a magic number followed by a timestamp. I have isolated the delayed packets by comparing the timestamps of the payloads with the timestamps put by tcpdump. The payload ts of frame #9 is very close to the one of frame #6 and #7 and clearly less than the timestamp of the received ack in frame #8.
update-1: The fact that frame #9 is not sent immediately can be explained by the slow-start of the TCP channel. In fact, the problem also appears once the connection is running for several minutes so the slow-start does not seem to be the general explanation.
20:53:26.017415 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [S], seq 2473022771, win 5840, options [mss 1460,sackOK,TS val 4219180820 ecr 0,nop,wscale 8], length 0
20:53:26.017423 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [S.], seq 2948065596, ack 2473022772, win 5792, options [mss 1460,sackOK,TS val 186598852 ecr 219180820,nop,wscale 9], length 0
20:53:26.091940 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 1, win 23, options [nop,nop,TS val 4219180894 ecr 186598852], length 0
20:53:26.091958 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [P.], seq 1:15, ack 1, w in 23, options [nop,nop,TS val 4219180895 ecr 186598852], length 14
20:53:26.091964 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], ack 15, win 12, options [nop,nop,TS val 186598927 ecr 4219180895], length 0
20:53:26.128298 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1:257, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 256
20:53:26.128519 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 257:385, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 128
20:53:26.202465 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 257, win 27, options [nop,nop,TS val 4219181005 ecr 186598963], length 0
20:53:26.202475 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], seq 385:1833, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 1448
20:53:26.202480 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1833:2305, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 472
If this matters, both ends are Linux RHEL5 boxes, with 2.6.18 kernels and network cards are using e1000e drivers.
update-3 Content of /etc/sysctl.conf
[jlafaye@localhost ~]$ cat /etc/sysctl.conf | grep -v "^#" | grep -v "^$"
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.ipv4.tcp_rmem = 65536 4194304 16777216
net.ipv4.tcp_wmem = 65536 4194304 16777216
net.core.netdev_max_backlog = 10000
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_mem = 262144 4194304 16777216
kernel.shmmax = 68719476736
After doing a little more digging into my traffic, I was able to see that my data was nothing but a sequence of small bursts with small idle periods between them.
With the useful tool
ss
, I was able to retrieve the current congestion window size of my connection (see thecwnd
value in the output):I ran the tool several times and discovered that the congestion window size was regularly reset to the initial value (10ms, on my Linux box). The connection was constantly looping back to the slow start phase. During the slow start period, bursts with a number of messages exceeding the window size were delayed, waiting for the acks related to the first packets of the burst.
The fact that the traffic consists of a sequence of bursts likely explains the reset of the congestion window size.
By deactivating the slow start mode after idle period, I was able to get rid of the delays.
This isn't going to be some subtle thing like a setting somewhere. This is going to be a problem with the protocol layered on top of TCP or a code bug. There's no magic "go faster" switch for TCP except for unusual cases like networks with very high latency or packet loss caused by noise.
The most obvious explanation would be if the code calls
write
orsend
with very small chunks. You need to accumulate at least 2KB per send, ideally 16KB. You say you batch the messages, but it's not clear what that means. Do you pass them in one call towrite
orsend
? Do you bundle them into a single protocol data unit for the protocol layered on top of TCP? Doing both of these things helps a lot with latency.Also, get rid of TCP_NODELAY. It can reduce throughput. It's only for applications that weren't designed to work with TCP or for applications that cannot predict which side will need to transmit next.
Unless, of course, you are in fact layering a protocol on top of TCP where you don't know which side is going to transmit next (like
telnet
, for example). Then it can make sense to set TCP_NODELAY. Significant expertise is required to make that kind of protocol work with low latency. If that's your situation, post more details about the protocol you're layering on top of TCP, what its protocol data unit sizes look like, and what determines which side transmits when.If you do in fact batch the messages available at one time and pass them in a single call to
write
orsend
, then most likely the problem is that the other side is not sending an application-layer acknowledgement for each batch. These improve latency by giving TCP ACKs packets to piggyback on. Your protocol should include them to ensure sides alternate which helps keep latency down.