We're reviewing Wireshark captures from a few client machines that are showing multiple duplicate ACK records which then triggers retransmit and out-of-sequence packets.
These are shown in the following screen shot. .26 is client and .252 is server.
What causes the duplicate ACK records?
More background if it helps:
We're investigating network throughput concerns at one particular client site. The perceived issue from a user interface perspective is that data is being transmitted slowly despite an underutilized 1gbps WAN connection.
Almost all of the client machines have the same issue, tested at more than 20 machines. We did find two machines that do not have the problem. We're in the process of identifying what is different in their configuration. We did notice that in the two machines that do not have the problem, we only ever saw at most one duplicate ACK record. The machines that have the problem usually have three duplicate ACK records. One notable difference is that the machines that work fine all belong to members of the network operations team and all of other machines are for "regular" employees. The machines are supposed to be standard but the network admins could have made changes on their local systems, which is another aspect we're researching.
We tried changing the TcpMaxDupAcks setting on the server but the value we really need is 5 and the valid range is only 1-3.
Server is Windows Server 2003. Clients are all enterprise managed Windows XP. All clients, including the two working ones, have Symantec anti-virus installed.
This is the only client site out of hundreds that has exhibited this problem.
pathping
shows 56ms RTT and consistent 0/100 packet loss even from the problem machines.
Thanks,
Sam
Note: I'm assuming that this capture was taken on the client machine.
A brief summary on TCP sequencing: TCP reliably delivers streams of bytes between two applications. "Reliably" in this case means that, among other things, TCP guarantees to never deliver out of order data to a listening application.
In-order, reliable delivery is implemented through the use of sequence numbers. Every packet in each stream is assigned a 32 bit sequence number (remember that TCP is effectively two independent streams of data, A->B and B->A). If A sends an ACK to B, the value in the ACK field is the next sequence number A expects to see from B.
From the above, it appears that at least one TCP segment being sent from the server to the client was lost. The three duplicate ACKs in sequence are an attempt by the client to trigger a fast retransmit. When a TCP sender receives 3 duplicate acknowledgements for the same piece of data (i.e. 4 ACKs for the same segment, which is not the most recently sent piece of data), it can reasonably assume that the segment immediately after the segment being ACKed was lost in the network, and results in an immediate re-transmission.
In this case, the re-transmission gets through, and is identified by Wireshark as out-of-order.
As mentioned by joeqwerty, packet loss is most often caused by congestion. It may also be a result of CRC or other errors on a link, due to a bad interface card, loose cable, etc. I'd look at the stats of every link along the path to see if any are highly utilized and/or are experiencing large numbers of errors.
If you can't see any obvious candidates, perform concurrent packet captures at multiple points along the path to try and isolate where the loss is occurring.
What kind of WAN connection is in use here? Is it a dedicated line? MPLS VPN link? IPsec VPN over the public internet? Something else?
While you are isolating where the problem is, think of a packet dump as just one of the symptoms... As an analogy, if someone walks into the doctor's office with chest pains, the doc won't spend three hours investigating the nature of the pain. He spends about two-minutes on that and then knows that 95% of the causes are either heartburn or angina... In the same way, if you see duplicate ACKs, don't rat-hole on the weeds of the trace right away.
After the connection establishes, slow TCP performance is not always because of transit network problems; sometimes it comes as the result of server CPU or disk limitations... and occasionally because of some issue on a client PC. I have chased my tail for weeks digging into the weeds of wireshark traces only to give up and find the problem relatively quickly with mtr, or by looking at other host metrics such as CPU and disk I/O.
Your first task is to prove whether this is a network issue or a host-level issue. Focus on sending real traffic through your network and prove whether you're queuing / loosing / re-ordering Note 1 it; that always is the bottom-line for a potential network issue like this.
I would do a
ping
sampling for an extended period of time (typically an hour for me) between the client and server while the throughput problem is happening; you can use mtr or ping plotter freeware for this. If you're consistently loosing packets at some hop, and all hops afterwards loose as much or more, then you have a potential network suspect. Keep in mind that device ICMP rate-limiting can cause some hops to appear that they loose packets... that's why you want to look for a trend starting from that hop, and those following.Note 1 If you are re-ordering traffic, that will show up rather quickly in the expert info field that wireshark provides
By seeing lots of [TCP segment of reassembled PDU] without ACKs - I'd say those ACKs are likely shown as [TCP Dup ACK ...] due to Selective Acknowledgement (aka SACK) behavior.
Example:
client sends data parts (...,0,1,2,3,4,5,6,...)
server acked (0), then received (2,4,3), then (5), then (6) and never got (1)
In above scenario - server can legitimately choose to ack (2-4) range first, then (2-5) range, then (2-6) range. While forming the "(A-B) range ack" packet - server has to specify the last-acked part (0) in TCP header. Wireshark marks the range-acks (SACKs) as [TCP Dup ACK ...] because all those range-acks have same last-acked part value in TCP header (Ack=872619 in Your case).
Duplicate ACK's in combination with slow network performance sounds like a network congestion problem to me. Look at the volume and rate of broadcast traffic on the network. Make sure to look at physical layer and network layer broadcasts as well as multicasts.