I have a DAG set up between 2 Exchange servers at 2 different sites, connected via a dedicated gigabit circuit. For some reason, every so often, the remote host will get removed from the active failover cluster membership (even though its online), then the two Exchange servers will get into a fight over who's the primary (event logs report that the "File Share Witness failed to arbitrarte for the file share", the file share does exist and the Exchange Trusted Subsystem does have admin rights on that box and on that share but the DAG setup box reports that it does not).
then all the user's get prompted for a password and sometimes the entire thing goes down for a few minutes. Even Public Folders get knocked offline sometimes.
Its reporting that there are 3 different subnets (an access subnet, the iSCSI subnet, and a non-routable IPv6 subnet) and I have replication disabled on all but the access subnet. This is the same subnet where the DAG has its IP addresses (I gave it 2). Replication is disabled on the other two subnets.
Anyone run into this issue before?
Turns out this was a known issue with Exchange 2010, here's more information Exchange Team Blog
Try raising the default heartbeat limits.
http://technet.microsoft.com/en-us/library/dd197562%28WS.10%29.aspx
I would suggest 25 seconds for local servers and 50 seconds for WAN connections. This should help. If you use storage arrays and VMware and perform storage scans, it will cause fail-overs. If you use jumbo frames on iSCSI that is seen by your MAPI network, it can cause them. There are many reasons. I would start there.