We have setup a Windows Network Load Balancing (NLB) cluster of 2 hosts for our new staging environment. The cluster initially works fine, but eventually, after stopping and starting one of the hosts via our automated deployment script, the first host can no longer see the second host. So if you open NLB Manager on Host1, Host2 is not visible. This doesn't happen if you open NLB Manager on Host2. Edit: Actually, sometimes Host2 cannot see Host1 either. When that happens, the cluster is completely unresponsive to requests.
Things we've noticed during the "bad state":
- The hosts can ping eachother.
- RPC works because I can access the C$ share of one host from another.
- If I try to manually add the missing Host2 to Host1, it says it already exists. I can click Cluster > Connect to Existing and specify Host2, which works, but only until I close NLB Manager and open it again.
- When the cluster is in the bad state, if I try to start Host2, it says "Converging" but never changes to "Converged".
Things we've tried that did not fix the problem:
- Removed all NLB stuff and recreated the configuration from scratch.
- Removed and re-added the network adapter in Device Manager on one of the hosts.
- Switching from Multicast to Unicast.
- Rebuilt the second node's VM from scratch.
Restarting the servers seems to fix it temporarily, until it happens again.
Configuration:
- Both hosts are running Windows Server 2012 R2 with the latest updates as of 2015-09-21. Before the NLB setup, the second host was cloned from an image of the first host.
- Both hosts are running as VMWare guests on the same VMWare host. I'm not sure of the version of VMWare (that's up to our admins) but the VMWare tools on the guest OS's say version 9.4.
- Each host has a single Ethernet adapter with 2 IP's assigned: the host's dedicated IP, and the cluster IP.
- Port rules: Multicast, ports 80 and 443 only, Load Equal, Affinity Single
I have seen this type of behavior using Multicast on a pair of switches where each physical host is connected only to one or other switches. The default switch configuration stops the NLB servers talking. We had to apply a switch configuration setting to get them talking.
Quick check is to set them to Unicast first. If this works then look at the switch configuration.