First, we have a Windows 2008 R2 Two Node cluster running HA Hyper-V and DHCP. We utilize a back-end Dell MD3000i iSCSI SAN for storage. All of the networking is done via redundant switches and MPIO drivers. The data network is on a different VLAN than the primary network.
Here is the scenario we keep encountering:
We have power outages sometimes. We have dual UPS devices in the cabinet and they last for about 15 minutes or so, but if we don't get power back everything goes down, cluster nodes, SAN and all.
Eventually the power comes back up, all of the devices are configured to boot when AC returns. However, when we have a complete outage like this the cluster never comes back online properly. We get the usual errors like the Quorum disk is unavailable, etc. In addition our two primary domain controllers are virtual machines on top of the VM Cluster. We do have a physical server running as another domain controller thinking this would help when things come back online.
What we are not understanding is why the system is not able to recover itself when it boots, there is an available DC for authentication, eventually. The iSCSI network comes back online, is there something else we are missing?
I think it may be related to the iSCSI Initiator service not starting quickly enough when the cluster service is ready to go.
Any ideas or things I can post to help?
Thanks, Brent
We had the same problem with our cluster not coming back up cleanly after a power failure. Like you, the shared storage is located on iSCSI SANs. The fix for us was to ensure that VM host and guest startup was delayed long enough to ensure the SANs were back online FIRST. We found that if we didn't do this, the shared volumes would reconnect, but remain in an offline state, thus causing the cluster to fail....
I ran into this problem on my own system. After a power failure the cluster just wouldn't come back up, either because the domain controller wasn't ready, or the SAN wasn't ready yet. For those that don't have any managed PDUs or bios options to delay startup, and need to add a boot delay, there's an easy method posted in this blog
On Server 2008, open a command prompt and type:
This creates a second boot menu option (needed for the timeout to appear) and sets the timeout to 5 minutes (300 seconds). The server will sit at the boot menu until the timeout is reached or someone presses the enter key.