I'm running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B. Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. On the open solaris I dtraced the connections and all the interactions look the same - same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending and ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives. All three machines are on the same switch. Here is the Dtrace output: Fast Machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 122 1448 \ 195200 7 131768 2 128872 1073725440 37 1448 \ 195200 7 131768 2 128872 1073725440 20 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 19 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 57 1448 \ 195200 7 131768 2 128872 1073725440 171 1448 \ 195200 7 131768 2 128872 1073725440 29 912 \ 195200 7 131768 2 128872 1073725440 30 / 0 195200 7 131768 2 128872 1073725440
slow machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 161 / 0 195200 7 131768 2 127424 1073725440 52 1448 \ 195200 7 131768 2 128872 1073725440 33 1448 \ 195200 7 131768 2 128872 1073725440 11 1448 \ 195200 7 131768 2 128872 1073725440 143 / 0 195200 7 131768 2 128872 1073725440 46 1448 \ 195200 7 131768 2 130320 1073725440 31 1448 \ 195200 7 131768 2 130320 1073725440 11 1448 \ 195200 7 131768 2 130320 1073725440 157 / 0 195200 7 131768 2 130320 1073725440 46 1448 \ 195200 7 131768 2 131768 1073725440 18 1448 \ 195200 7 131768 2 131768 1073725440
Dtrace code
dtrace: 130717 drops on CPU 0 #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; dtrace:::BEGIN { TIMER = ( TICKS != NULL ) ? TICKS : 1 ; ticks = TIMER; TITLE = 10; title = 0; walltime=timestamp; printf("starting up ...\n"); } tcp:::send / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %8s / %8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; }
Followup added to to include the number of unacknowledged bytes and it turns out the slow code does run up it's unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it's congestion window. Here is the output from the slow machine when it's unacknowledged bytes hit the congestion window:
unack unack delta bytes bytes send recieve cong ssthresh bytes byte us sent recieved window window window sent recieved 139760 0 31 1448 \ 195200 131768 144800 1073725440 139760 0 33 1448 \ 195200 131768 144800 1073725440 144104 0 29 1448 \ 195200 131768 146248 1073725440 145552 0 31 / 0 195200 131768 144800 1073725440 145552 0 41 1448 \ 195200 131768 147696 1073725440 147000 0 30 / 0 195200 131768 144800 1073725440 147000 0 22 1448 \ 195200 131768 76744 72400 147000 0 28 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 26 / 0 195200 131768 76744 72400 147000 0 17 1448 \ 195200 131768 76744 72400 147000 0 27 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 56 / 0 195200 131768 76744 72400 147000 0 22 1448 \ 195200 131768 76744 72400
dtrace code:
#!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; tcp:::send, tcp:::receive / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); }
Now it still is a question as to why one machine falls behind and the other doesn't ...
I have seen behavior like this before. I've seen two causes for it:
TCP/IP flow-control problems are less likely in your case since both machines are running the same kernel and (except for the device kernel modules if different) therefore running the same TCP/IP code.
Drivers though.
I had a Windows 2003 server a while back that simply couldn't transfer more than 6-10MB/s to certain servers, and as that was a backup-to-disk server this simply wasn't acceptable. After looking at some packet captures, they looked a LOT like what you're seeing. What fixed it was to update the network drivers (broadcom as it happened) on the receiving server (the Server 2003 backup server) to something newer. Once that was done, I was getting 60-80MB/s.
Since this is Linux, you just might be running into a Large Segment Offload problem of some kind. This does rely in some part on the NIC hardware itself handling the splitting of large segments. If that is not working for some reason (bad firmware?) it can cause these kinds of odd delays. This is configured on a per-driver or interface basis.
ethtool -K
can configure it by device.