We are running into an issue here where the Cluster-Shared Volume attached to our Hyper-V 2012 R2 cluster is being dropped off / faulted with the slightest interruption from the iSCSI SAN connection it's using. This is, of course, a problem as it causes all the VMs to crash or shutdown.
The interruptions in the iSCSI SAN connection happen when the primary SAN node fails over to its replica. There are about 10-15 seconds of downtime before the secondary picks up. We are using a FreeBSD + ZFS based solution in conjunction with HAST + CARP to provide high availability storage.
The failover works when a non-clustered iSCSI LUN is mounted on the Windows side, for example, a normal connection with the initiator. The I/O simply gets paused until the connection is reestablished. I expected the same behavior with the CSV but alas, it seems to be very picky about I/O timeouts.
Is there a way to lengthen the CSV timeout, or perhaps some other fix for this issue?
have seen this happening with a lot of people.
most of the time the issue is one of these:
backup traffic isn't perfectly isolated from the cluster management traffic. since Ethernet is a bitch, the increase in packet collisions dramatically reduces bandwidth and roundtrip time for the heartbeat. and then boom! CSV is down
another common issue is the overall Ethernet speed is too low compared to the overall load. when the backup starts you get a huge spike in traffic for all kinds of reasons.
to my knowledge there isn't a way to prolong the timeout. CSV is extremely picky with the heartbeat timeout indeed. After having encountered this issue on a couple of sites, we set the I/O Speed limit in BackupChain to reduce the risk of this occurring. However, the real solution is to avoid getting these connection gaps in the first place, from what I have seen so far...