Looking for some help with a problem that has me at my wit's end - I've troubleshot this for days and cannot figure it out.
All of a sudden, a few days ago, copying files from some of our Windows XP SP3 x86 workstations to one particular Windows 2008 R1 x64 server has become super-slow - think 7 minutes to transfer a 10MB file. The problem is only workstation -> server, copying in the other direction is working normally.
Copying files to this same server (and same file share) has been fine for months previously, and as far as I know nothing has been changed on the server, workstations, Group Policy, etc. The workstations are physical machines, the server is a VM running in ESX 3.5, everything is connected by gigabit LAN and all are joined to the same (Windows 2008 funtioncal level) domain.
There is nothing obviously wrong on either the workstations or the server - no CPU/memory/disk issues or spikes, no event log entries, no apparent DNS or Active Directory issues, etc. Also, apart from this specific problem, the workstations and the server are behaving entirely normally (including network copy to other servers / shares).
Through some trouble-shooting I've established that problem is only occurring on some of our workstations - specifically, three machines that are used by our IT department. This does mean slightly different Group Policy and application set, but as I mentioned above, nothing should have changed around when the problem started, and none of these machines have anything unusual installed on them that should affect the network or file sharing.
Another unusual aspect to this problem is that it's occurred once before - involving exactly the same workstations and a different server, but in both cases the problem servers are almost identical - Windows 2008 x64 VM, running IIS7 as their /only/ app, being used as our development web server. Last time round we just nuked the server (and replaced it with the one that has the issue this time) which fixed the problem until now, but given the problem has repeated itself I want to get to the root of it.
Here's what I've tried so far, all to no avail:
- Rebooted :-)
- Disabled anti-virus and firewall.
- Turned off every service possible on the server.
- Re-installed VMware Tools on the server.
- Updated network drivers on the workstations.
- Used different user accounts - it's machine specific, not user specific.
- Created new shared folders / shares on the server.
- Used several different copying methods - Explorer, TeraCopy and xcopy.
- Mapped the share using IP, NetBIOS name and FQDN.
- Flushed DNS and ARP caches.
- Forced DNS re-registration.
- Fiddled with Network Card properties (link speed, Flow Control, TOE & TSO options, MTU, etc).
- Uninstalled IIS7 on the server (thinking this was the common denominator between the two servers we've had issues with).
- Probably some other stuff I've forgotten by now...
I also tried capturing a network trace with WireShark. I don't know much about analysing these, but I did compare the trace of a "normal" copy to the trace of a "super-slow" copy, and the main difference seems to be lots of fairly long pauses (typically ~0.3 seconds) before a series of error entries beginning with things like "[TCP Retransmission]", "[TCP Dup ACK...", "[TCP Fast Retransmission]" and "[TCP Out-Of-Order]". Not sure if that's any help.
So - anyone have any bright ideas? I'm at a loss as to what could be wrong or how to fix it :-S
I would check the windowing on both servers. Server 2003 and XP have newer IP stacks and sometimes they fight aginst each other.
The retransmits and dupe acks are one server saying "slow down" or doing a source quench to make the other server feed data slower.
Try this site for some better explanation: link text
Well, we've managed to resolve this problem - except I'm not exactly sure how :-)
The problem has completely gone away, both on the server I posted about above, and also on the older development server I mentioned where we'd had this problem previously.
The only things that have changed on / affecting these servers are:
So, while I don't know what exactly fixed it, it's highly likely to be one of the above three items - hopefully that gives anyone else experiencing the problem some ideas to try.
Thanks for the responses all.