On a Windows 2008 R2 SP1 cluster running Hyper-V, a lost network connectivity on the primary host interface. The interface was rapidly flapping up and down, and this was later determined to be caused by a faulty switch port.
As this was a clustered server, the host interface was not fault tolerant (seeing as how the whole server was fault tolerant), so connectivity to the host was going up and down.
The Hyper-V guests were completely unaffected by the network outage as they used a dedicated trunk on the server separate from the host interface. Additionally, dedicated interfaces for the cluster and live migration networks were fine.
In order to diagnose the server, I tried to move all resources (Hyper-V Guests) to other nodes through Failover Cluster Manager. These moves failed with an error RPC Server Unavailable
.
The only way to move resources was by shutting down the guests, stopping the cluster service on the Node A, allowing other nodes to take ownership of the resources, and restarting the guests.
A few other notes:
- All nodes have Client for MS Networks and File & Printer Sharing enabled on the Cluster and LM networks.
- Node A was accessible over cluster and LM networks from other nodes (these are private, cluster-only networks); pingable, CIFs, etc.
- Accessing \\NODEA is done over the Host adapters, as you would expect in this case and is the reason for the
RPC Server Unavailable
error with that adapter being down.
My questions here are -
- Is there a way to still use Live Migration in a failure scenario such as this to prevent shutting down the Hyper-V guests?
- How can the network be reconfigured in the future so that the cluster service attempts to use the cluster and/or live migration networks to issue the RPC requests?
Great question!
The most likely reason for the RPC failure is that the cluster name resource (and IP address) were likely hosted on the server whose primary network connection was flapping.
Since the interface was going up and down, access to the cluster via the cluster name would likely fail due to the network interruptions.
You should be able to execute commands against the cluster from the command line (either cluster.exe or the FailoverClusters module in PowerShell). The FailOverClusters module can be used over PowerShell remoting if the appropriate credential delegation is configured (either CredSSP or Kerberos).
In the case of a failure of the network interface hosting the cluster name, you could use PowerShell to move that cluster group to one of the nodes that is accessible, or you could just execute commands against the cluster to migrate machines, etc..
To ensure that this does not happen again, you likely will need to make the NIC highly available (NIC teaming). This is dependent on where you are managing the cluster from.. one of the servers or a remote management station. If you are managing from a cluster machine in that same cluster, you could add an IP on the cluster network to the cluster name, but you would want to make sure that was not added to DNS, otherwise you might interrupt remote management clients from being able to connect.
To add an IP Address to the cluster group via PowerShell -
You'll need to disable dynamic dns registration and create static entries if you don't want remote management clients trying to talk to the private network.