On a Windows 2008 R2 SP1 cluster running Hyper-V, a lost network connectivity on the primary host interface. The interface was rapidly flapping up and down, and this was later determined to be caused by a faulty switch port.
As this was a clustered server, the host interface was not fault tolerant (seeing as how the whole server was fault tolerant), so connectivity to the host was going up and down.
The Hyper-V guests were completely unaffected by the network outage as they used a dedicated trunk on the server separate from the host interface. Additionally, dedicated interfaces for the cluster and live migration networks were fine.
In order to diagnose the server, I tried to move all resources (Hyper-V Guests) to other nodes through Failover Cluster Manager. These moves failed with an error RPC Server Unavailable
.
The only way to move resources was by shutting down the guests, stopping the cluster service on the Node A, allowing other nodes to take ownership of the resources, and restarting the guests.
A few other notes:
- All nodes have Client for MS Networks and File & Printer Sharing enabled on the Cluster and LM networks.
- Node A was accessible over cluster and LM networks from other nodes (these are private, cluster-only networks); pingable, CIFs, etc.
- Accessing \\NODEA is done over the Host adapters, as you would expect in this case and is the reason for the
RPC Server Unavailable
error with that adapter being down.
My questions here are -
- Is there a way to still use Live Migration in a failure scenario such as this to prevent shutting down the Hyper-V guests?
- How can the network be reconfigured in the future so that the cluster service attempts to use the cluster and/or live migration networks to issue the RPC requests?