I've spent way too much time on this.
We have a 8 node Microsoft Private Cloud, hosted on a Cisco Flexpod (B200 blades, Nexus 5k, 6248FI's with two NetApp FAS2550 controllers for SAN). Running UCS 2.2(5a) firmware.
All the hosts SAN boot, and run Server 2012 R2 Datacentre. There is a CSV mounted on each host, that hosts our 70-odd virtual machine's VHDX's.
Recently, we moved to Visual Studio Online and commissioned a number of build servers (well, 3). Once the build is complete, the artifacts are published to our staging and testing environments, each consisting of a single virtual machine running Server 2012 R2 Standard. This publishing uses Robocopy to copy the artificats to the C$ share of those virtual machines.
When that copy happens, we see the following:
- The virtual machine's GUI becomes unresponsive
- When connecting to the VM during this state, we are unable to log in (sometimes ctrl-alt-del has no effect, sometimes the login prompt is shown but typing doesn't show in the password box)
- If we were logged in before the CIFS/SMB traffic started, GUI elements keep running until you interact with them
- After a while, all virtual machines hosted on the same Hyper-V host start experiencing timeouts
- The VM is unresponse to shutdown commands through Failover cluster manager, and we have to turn off the VM which takes a little time, but completes
- After rebooting the VM, it's fine again until you try and copy to it again
- Existing VMs (i.e. VMs commissioned a long time ago) are unaffected, it's only ones commissioned in the last month
To debug, I tried a manual (i.e. Windows copy and paste), which exhibits the same issue.
I've tried:
- Changing receive side scaling settings
- Disabled VMQ (even though we don't have Broadcom adapters, but Cisco VICs), both on the host and on the VM's network adapter
- Restarted the entire cluster (rolling restart of hosts)
- Building a new VM, without any Windows updates. Experiences the same issue
- Confirmed that we don't have any duplicate IP addresses
- There is no AV running on any of the hosts or guest VMs
- Because GUI items that are open before the issue starts keeps running, I ran Resource Monitor and checked the disk utilization. When the issue starts, the disk IO drops to pretty much 0. On this point (along with NetApp specific monitoring tools, and the fact that VM's on all the other nodes keep running) I've eliminated the storage component as being the culprit. See below the screenshot of when the copy started:
Note the drop in disk IO. Incidentally, all other VMs on the same Hyper-V host's disk IO drops to 0 at the same time.
Out of frustration, this morning I created a Gen1 Virtual Machine, and commissioned it as I would any other Gen2. This for some unknown reason, works. If I copy to the C$ share of a Gen2 machine, it fails. If I copy from the exact same location, to the C$ share of this new Gen1 machine, there are no issues.
Update: I've also noted that copying from the Gen2 machines is fine. Just when copying TO them does the issue exhibit.
What could be causing this? What is the difference between Gen1 and Gen2? Could it be a UCS firmware issue.
0 Answers