I have delays when sending data through a TCP channel I am not able to understand. The link is a 1Gb link with a end to end latency of roughly 40ms. In my current setup, latency (the time from one message to go from the sender user space to the receiver user space) can reach 100ms.
The sender socket is configured with the TCP_NODELAY option. Sender buffer (SO_SNDBUF) is configured to be 8MB. Receive buffer (SO_RCVBUF) is also configured to be 8MB. Tcp window scaling is activated.
update-1: I am using the zeromq 3.1.1 middleware to carry data. Socket configuration, including the TCP_NODELAY flag is performed by the middleware. Some options are accessible like rx and tx emit buffer sizes but not TCP_NODELAY. As far as I have understood, the TCP_NODELAY is activated to ensure that the data is sent as possible. In the meantime, actual socket sends and decision to send a message are performed in two separate threads. A proper batching is done if several messages are available at the time the first message in the batch is to be sent.
I ran a capture with tcpdump from which the frames below have been extracted. After the initial TCP handshake, the sender (172.17.152.124) starts sending data. The initial window size is 5840 bytes for the receiver & 5792 bytes for the sender.
My problem is that the sender sends two frames (#6 and #7) then stops, waiting for an ack to come back from the receiver. As far as I can see, the window size of the receiver is not reached and the transfer should not stop (384 bytes outstanding with an initial receive window size of 5840 bytes). I am starting to think that I have no understood correctly what TCP is. Can someone help clarifying ?
update-2: My data payload consists of a magic number followed by a timestamp. I have isolated the delayed packets by comparing the timestamps of the payloads with the timestamps put by tcpdump. The payload ts of frame #9 is very close to the one of frame #6 and #7 and clearly less than the timestamp of the received ack in frame #8.
update-1: The fact that frame #9 is not sent immediately can be explained by the slow-start of the TCP channel. In fact, the problem also appears once the connection is running for several minutes so the slow-start does not seem to be the general explanation.
20:53:26.017415 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [S], seq 2473022771, win 5840, options [mss 1460,sackOK,TS val 4219180820 ecr 0,nop,wscale 8], length 0
20:53:26.017423 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [S.], seq 2948065596, ack 2473022772, win 5792, options [mss 1460,sackOK,TS val 186598852 ecr 219180820,nop,wscale 9], length 0
20:53:26.091940 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 1, win 23, options [nop,nop,TS val 4219180894 ecr 186598852], length 0
20:53:26.091958 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [P.], seq 1:15, ack 1, w in 23, options [nop,nop,TS val 4219180895 ecr 186598852], length 14
20:53:26.091964 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], ack 15, win 12, options [nop,nop,TS val 186598927 ecr 4219180895], length 0
20:53:26.128298 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1:257, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 256
20:53:26.128519 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 257:385, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 128
20:53:26.202465 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 257, win 27, options [nop,nop,TS val 4219181005 ecr 186598963], length 0
20:53:26.202475 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], seq 385:1833, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 1448
20:53:26.202480 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1833:2305, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 472
If this matters, both ends are Linux RHEL5 boxes, with 2.6.18 kernels and network cards are using e1000e drivers.
update-3 Content of /etc/sysctl.conf
[jlafaye@localhost ~]$ cat /etc/sysctl.conf | grep -v "^#" | grep -v "^$"
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.ipv4.tcp_rmem = 65536 4194304 16777216
net.ipv4.tcp_wmem = 65536 4194304 16777216
net.core.netdev_max_backlog = 10000
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_mem = 262144 4194304 16777216
kernel.shmmax = 68719476736