I have a really strange one.
I have packet loss with Excessive 'TCP Dup ACK' & 'TCP Fast Retransmission' when I download files (and only download) from two different Windows 2008 servers. Upload speed is fine.
This ONLY occurs if the client computers(Win7) is connected at 100mb/s. At 1GB, no errors and I get full speed. If I set the client nic to 100Mb/s, I get a lot of 'TCP Dup' errors and the download speed drops to around 2-5MB/s. Upload speed is 10MB/s or above.
This only happens to the Windows 2008 Server boxes (Dell, but different hardware). This problem does not occur if I transmit between the Win7 clients and the Linux servers.
It's like Server 2008 is unable to scale the TCP window properly, overloads the switch or something, then pauses traffic for a bit.
Parts of the network run at 100Mb/s due to older equipment, so this is really causing a problem in some buildings.
I have uploaded a pcap file from the client here. https://dl.dropboxusercontent.com/u/24907255/slow.pcap.gz
It shows a 50MB file being written to the server, then read back from the server with the errors.
Thanks for any help. I am stumped.
11/28/13 More Information.
I shutdown the entire network so that only one client and one server are on the network. No change in the problem.
If I set every interface, server, client and Cisco 2960 switch to 100Mbs full, then the problem goes away. If I set the server and switch interface auto or 1Gbs, the problem is back.
If I bypass the switch with a Netgear 10/100 switch and set both client and server to auto, I have no problems.
I did discover this. In the normal setup, with server to switch at 1Gbs, I plug in the Netgear 10/100 switch between the client and Cisco switch, my speed problem is even worse. Speeds go from 5-7MB/s to 2-3MB/s, and yes I have tried fixed and auto network speeds. This would explain why some of the buildings that have a 2 switch hop between them and the main Cisco switch have more of a speed problem.
On to pinging. With everything at 1GB/s, I can ping a full TCP payload, ping -l 65500 and it works. With the client at 100Mbs, the max size I can ping is 17752. Anymore and it fails, to the Windows servers only, no problem on the Linux boxes. With the Netgear 10/100 between the server and client, no problems pinging at 65500.
Update 3
I swapped in a PowerConnect 2748 switch. Same problem with the server at 1Gbs and the client at 100Mbs. I can ping over 17752 now tho. Strange. So I don't think it's the Cisco switch.
Update 4. I am trying to get some hard numbers by using ipref. All systems connected to the same switch, with the client set to 100Mbs and running the command ipref.exe -c -u -b 10m. So sending to the server. One server is 2008 with no load on it right now, other is a Ubuntu with a load avg of .20.
At 10m
- Linux jitter 0.022ms, packet loss is 0/8505
- Server 2008 jitter 1.859, packet loss 68/8505
Pushing it to 100m
- Linux jitter 0.445, packet loss 0/26634
- Server 2008 jitter 0.542, packet loss 94/26596
Now for stats sending TO the client at 10m
- Linux jitter 0.271 ms, 0/ 8500 (0%) 1 datagrams received out-of-order
- Server 2008 jitter .063, 20/8505 (0.24%)
Pushing it to 100m
- Linux jitter 0.230 ms 4083/85443 (4.8%), 1 datagrams received out-of-order, 95.7Mbs
- Server 2008 jitter 0.237, 28174/81718 (47%), 51.1mbs
So Server 2008 is poor in general, but you can see the huge packet loss 47% when the connection is pushed to the clients 100mbs limit.
Update 5.
When I tested with the PowerConnect 2748 switch, I used different cat5 cable between the server and switch and client and switch. This should rule out cabling or switch issues.
I have two Windows 2008 Servers in this environment, installed at different times, and on different hardware. The only thing they share is a Broadcom branded nic, but the chipset is different. Both experience the same problem, but I am doing my main testing on one so in case something goes wrong, the other will still work.
The one server has a built on BCM5709C with two ports, and an add-on card, pci express I think, card also with the same BCM5709C chipset and two ports. I have tried all of them and the problem still exist. So this should rule out any hardware problems.
Update 6 12/3/13 I installed the Intel nic. No change. I played around with the ctcp settings and no change there. I even turned off SMB2 and no difference.
I did some more testing at 100Mbs Copying a 3GB ISO image TO the server, drag and drop, averages out at 10MB/s. Copying the same 3GB ISO image FROM the server, averages out at 6.3MB/s.
With all network interfaces set to Auto and at 1Gbs. Copying the ISO TO the server, averages 101MB/s Copying the ISO FROM the server, averages 57MB/s
So read speeds from the server are almost half the write speeds.
This sounds like a speed/duplex mismatch causing collisions and retransmits. Misconfiguration between the server and the other side could cause this. Another reason for the mismatch could be failing autonegotiation.
Make sure both ends of the connection are configured identically regarding speed and duplex.
I believe you should investigate if any of the NIC driver/Windows NDIS offload settings relate to your problem. I am most suspicious of the LSO (Large Send Offload) function as I've seen it totally wreck a service (Dell server w. Broadcom NIC) in a manner which defied all troubleshooting book definitions of anything.
The actual effect of LSO when it disrupts rather than enhances, is that the LSO engine may pass larger data frames that the switch supports. This causes the switch to silently discard those frames. Needless to say this causes performance degradation and packet loss. The failure can be imminent, but can also be intermittent making it tremendously difficult to troubleshoot. This is described in detail here: Large Send Offload and Network Performance
Disclaimer: this is just best effort thoughts on a possible angle on your problem. Implementing any one of the changes below will disrupt your network communication. The computer should be restarted after applying any of the settings. I copy/paste the most interesting settings for reference, but the links contain all the hardcore info and caveats. I most strongly recommend using the official docs as the basis for change and this post at most like a checklist.
Before proceeding with any of this, back up your registry key of:
One uncool reason is due to an official bug described below, which changes some unrelated values when certain settings are sent through the command line.
I freely admit that where settings are present in both the Windows NIC driver GUI and in Windows, I never really got clarity in if one has to disable both in the GUI and through Windows CMD/Registry, or if one suffices. The blogs I've read which presented an answer have been inconsistent with regards to some minor detail or other so I never was sure. Nowdays I attempt change everywhere I find the option for whichever setting I'm focusing on. The GUI options are not presented here, but are described in the official docs.
Also, different NIC drivers for the same card may present varying granularity in the advanced settings in the GUI.
Disabling Task Offloading
This registry setting disables task offloading as defined in Using Registry Values to Enable and Disable Connection Offloading.
If the above setting has any effect you could try going granular as specified in the link. There are quite a number of settings governing this so I won't paste them all in.
I'll supply the LSO ones though:
Disabling connection offloading
As defined in Using Registry Values to Enable and Disable Connection Offloading.
Disabling TCP Chimney, TOE and TSO
As specified in How to Disable TCP Chimney, TCPIP Offload Engine (TOE) or TCP Segmentation Offload (TSO) Note the Win2008 hotfix
and in Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008.
` The hotfix describes the issue thus:
Again, one may therefore wish to backup the entire registry key before doing anything:
If you google you problem together with offloading highlights from above, you'll find no end to posts, articles and blogs describing similar issues due to NIC offloading. But if it still doesn't work then I guess you can move on up the stack to try other things out, because it isn't due to half broken cable, NIC or switchport, right?
always look at the networking device for clues..... so, if cisco, do a "show interfaces f0/11" or whatever it may be in your case. retransmits can also be due to a bad ethernet port/nic/cable, such as due to "crosstalk"..... show int on the switch should show you these error stats, if thats the case, and it will be obviously way too high
EDIT: as this is microsoft, its most likely thats your problem, but other than that, in general, start at layer one (make sure phyical cables are good), and work your way up the stack, ... ie layer 2, speed/duplex/mac address fltering,.. then layer 3 ip/udp/tcp firewalling,...etc
This can also be "advanced" NIC atributes, like PowerManagement ones or IRQ priority. Assuming you have the same version of drivers. Go to:
Device Manager
->Network Interfaces
->Properties
for the NIC ->Advanced Tab
.Check and compare all values here.
Did you checked for jumbo frames are off on your 100/1000 network?
UPD:
If jumbo frames are used then all netowrking hardware on broadcast domain should use It. That is impossible with legacy 100mb devices.
I do not know how win2008 tcp works exactly but providing jombo frames it may start scaling transmission window with packet size (not packet count as usual). Then you will observe the situation like described.
FYI: http://m.windowsitpro.com/windows/q-how-do-i-enable-jumbo-frames
UPD2:
I looked to the packet dump you have supplied and saw a lot of packet with length > 1500 and bad checksums (checksums for lengths < 1500 are OK). It confirms my assumption.
The only thing I can not understand - they are relevant to the first session: from client to server (!!!???):
The effects you describe in your later findings are in line with the way IEEE 802.3u operates:
If you hard set the speed of one of the interfaces (NIC/Switchport) and set the other to Auto, you will likely suffer a duplex mismatch.
If you hard set one of the interfaces to full duplex, the other cannot autonegotiate duplex but must also have it hard set.
Even if both interfaces are hard set to Auto/Full duplex, some NICs(or poorly written Windows drivers) still leave the auto negotiation in operative mode and default to half duplex.
This is where I got those facts:
Two documents from Cisco relate (amongst others) to the 2900 series switches and troubleshooting NIC to switchport connectivity issues. They include concrete troubleshooting steps, especially for the switch side but also for the NICs. As Cisco has a lead on practical network analysis including in-depth knowledge of fundamental preconditions (such as the auto-negotiation electrical protocol), it is quite likely that the PowerConnect has similar working conditions (developed against the same protocol standards). I will quote freely for completeness and shape it up a bit later, but I would urge you to skim them through:
Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues
Configuring and Troubleshooting Ethernet 10/100/1000Mb Half/Full Duplex Auto-Negotiation
Here I quote some of the really interesting stuff:
Autonegotiation Valid Configuration Table
Then follows an extremely useful table which I'll try to port here later without loosing formatting. The table also includes 1Gbps speed combinations with similar interesting effects and comments. However, highlights include:
The table footnotes are most interesting:
Why Is It That the Speed and Duplex Cannot Be Hardcoded on Only One Link Partner?
The very last topic of the NIC Compatibility link carries a technical background to the effects described in the passages quoted above. The basis for this background are some key details of the operation of the auto negotiation protocol:
In addition I found bug reports to similar effect from Cisco, but they are very specific with regards to combinations of switch hardware/software, os version, nics and drivers. Without knowing exact details it gets too speculative.
I believe this may just be a confirmation of your findings, by way of protocol definition and operandum.
Solutions
So assuming this was not a wild (but fun) goose chase, I quote you:
1) "If I set every interface, server, client and Cisco 2960 switch to 100Mbs full, then the problem goes away. If I set the server and switch interface auto or 1Gbs, the problem is back."
2) "If I bypass the switch with a Netgear 10/100 switch and set both client and server to auto, I have no problems."
3) Try to find NIC/driver combinations compatible with the old switches. Purchase as neccessary.
4) Use solid technical references and reasoning to motivate budget for upgrading switches where neccessary.