We have a pretty nice piece of hardware set up to run multiple virtual machines in vmware and one of the vm's is an instance of Windows Server 2003 running SQL Server 2005. For some reason we occasionally see 10-20 seconds of straight packet loss to this machine from remote machines (my workstation) as well as other vm's on the same physical hardware. I am using PingPlotter to keep a close eye on the packet loss.
So far we've turned off flow control on the NIC but we are already running out of other things to try. What might be causing this and how can I identify the problem?
Note: We also have another server with a very similar configuration with the same type of problem to a lesser extent (because its not used as heavily?)
Interesting. First, lets establish some specifics...
You have an ESX host that is running multiple VMs, right?
You have one of those VMs as a Windows 2003 server.
You say when you run pings from a "remote" machine to that VM, you see 10-20 seconds of packet loss.
OK, immediate questions:
1) Does the packet loss occur when pinging from one of the other VMs running on that host?
2) Do any of the other VMs on that host (or the host itself) display the same behavior when you ping them in an identical manner from an identical place on the network?
3) Are any of the other VMs running the same operating system as the VM displaying the behavior?
4) Is there any kind of timing pattern? Does it happen every 5 minutes? Is it every so many packets. Do you always lose the same amount of packets?
5) When you go into the vSphere console, do you see any kind of performance graph changes that match the timing of your ping loss?
6) Is VMware tools installed on the VM and up to date?
Install/reinstall VMware Tools.
Check the load on the VMware server (CPU, interrupts, network traffic).
Check the host / hardware. You say you use VmWare - but not whether server or esx. Anyhow, could be a hardware or related problem (driver version etc.).
When I started using Hyper-V I got the same issue with some machines. Turned out to be a crappy driver + broken TCP offloading (in the driver). Some of them are just really crappily implemented. Put in an Intel network card, things worked.
Take a look at your storage. High write queues can result in high latency which can have symptoms just like you described.
I had the exact same problems. It was solved by taking the offending vm to a different vmfs/storage.