We're running a number of VMs on a 6 node failover cluster of blades using Hyper V.
We have an intermittent issue (every few days at different times - not a fixed frequency) of VMs losing network connectivity. Console access to the VM suggests all is fine and the underlying blade has normal connectivity. To resolve the problem we either have to re-start the VM or, more usually, we do a live migration to another blade which fires up connectivity and we then migrate it back to the original blade.
I've had 3 instances of this happen with a specific VM running on a particular blade however it has happened once with a different VM running on a different blade. All VMs and blades have the same basic setup and are running Windows 2008 R2.
Any ideas where I should be looking to diagnose the possible causes of this problem as the event logs provide no help?
Edit:
I've checked that each blade is running the latest NIC drivers and all seem to be fine.
Something that is confusing me - a failover or restart of the VM resolves the issue. Whilst I need to work out the underlying issue that is causing the NICs to hang I'm also concerned that the VM didn't failover to another node which would have solved the outage for me. Is there a way to configure the cluster so that it can tell that the VM guest has lost connectivity and fail it over? As things stand the cluster is assuming that the VM is running happily as I presume Hyper V says everything is great even though there is a problem.
Edit:
Thought I'd update this since the problem is still outstanding - less frequent but still seemingly random as to which VM is affected. Latest checks were that all VMs were running the same MPIO drivers and the same drivers versions for the virtual NICs. Everything looks to be identical with some VMs that are running on the same blade centre but outside of this cluster & these VMs have never experienced any problems.
Could this be the answer to your problem: http://support.microsoft.com/kb/974909
Do you by chance have port security turned on your for your switch ports? Make sure that you have a large enough number of MACs allowed. Also what is your network configuration like on the parents? Are you teaming?
Not the ideal answer that I'd hoped for but in this case it worked for our set-up...
We took the affected VMs out of the cluster, removed the NICs and then re-created them. In conjunction each blade was pulled from the cluster and had all drivers updated before they were pulled back in.
The loss of connectivity problem was clear for the next 6 weeks that I monitored them - a job change after than means I'm not sure if the problem is still resolved;)!