Marcel

Asked: 2016-09-29 02:16:13 +0800 CST2016-09-29 02:16:13 +0800 CST 2016-09-29 02:16:13 +0800 CST

CIFS/SMB causes virtual machine (and eventually Hyper-V host) to hang

I've spent way too much time on this.

We have a 8 node Microsoft Private Cloud, hosted on a Cisco Flexpod (B200 blades, Nexus 5k, 6248FI's with two NetApp FAS2550 controllers for SAN). Running UCS 2.2(5a) firmware.

All the hosts SAN boot, and run Server 2012 R2 Datacentre. There is a CSV mounted on each host, that hosts our 70-odd virtual machine's VHDX's.

Recently, we moved to Visual Studio Online and commissioned a number of build servers (well, 3). Once the build is complete, the artifacts are published to our staging and testing environments, each consisting of a single virtual machine running Server 2012 R2 Standard. This publishing uses Robocopy to copy the artificats to the C$ share of those virtual machines.

When that copy happens, we see the following:

The virtual machine's GUI becomes unresponsive
When connecting to the VM during this state, we are unable to log in (sometimes ctrl-alt-del has no effect, sometimes the login prompt is shown but typing doesn't show in the password box)
If we were logged in before the CIFS/SMB traffic started, GUI elements keep running until you interact with them
After a while, all virtual machines hosted on the same Hyper-V host start experiencing timeouts
The VM is unresponse to shutdown commands through Failover cluster manager, and we have to turn off the VM which takes a little time, but completes
After rebooting the VM, it's fine again until you try and copy to it again
Existing VMs (i.e. VMs commissioned a long time ago) are unaffected, it's only ones commissioned in the last month

To debug, I tried a manual (i.e. Windows copy and paste), which exhibits the same issue.

I've tried:

Changing receive side scaling settings
Disabled VMQ (even though we don't have Broadcom adapters, but Cisco VICs), both on the host and on the VM's network adapter
Restarted the entire cluster (rolling restart of hosts)
Building a new VM, without any Windows updates. Experiences the same issue
Confirmed that we don't have any duplicate IP addresses
There is no AV running on any of the hosts or guest VMs
Because GUI items that are open before the issue starts keeps running, I ran Resource Monitor and checked the disk utilization. When the issue starts, the disk IO drops to pretty much 0. On this point (along with NetApp specific monitoring tools, and the fact that VM's on all the other nodes keep running) I've eliminated the storage component as being the culprit. See below the screenshot of when the copy started:

Note the drop in disk IO. Incidentally, all other VMs on the same Hyper-V host's disk IO drops to 0 at the same time.

Out of frustration, this morning I created a Gen1 Virtual Machine, and commissioned it as I would any other Gen2. This for some unknown reason, works. If I copy to the C$ share of a Gen2 machine, it fails. If I copy from the exact same location, to the C$ share of this new Gen1 machine, there are no issues.

Update: I've also noted that copying from the Gen2 machines is fine. Just when copying TO them does the issue exhibit.

What could be causing this? What is the difference between Gen1 and Gen2? Could it be a UCS firmware issue.

CIFS/SMB causes virtual machine (and eventually Hyper-V host) to hang

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

CIFS/SMB causes virtual machine (and eventually Hyper-V host) to hang

0 Answers