I'm trying to get a better understanding of what is achieved by using RHEV/oVirt (or other OSS solutions) in an HA cluster. I'm interested in knowing how long it takes to fail over, and what exactly is happening when that happens, so I can judge whether or not this is an acceptable solution for different types of situations.
For example, what's the state of the system when it comes back - is it exactly where it left off, or is it like the power was pulled from the system, and it would be restarting after a power outage (thus having inconsistent disk states?)
I know this is a bit of a vague ask... but are there best practices for VMs to be run in an HA configuration like this, with the above considerations? From a layperson coming in with little to no experience, it seems like any application should be able to just be put on a VM and it'll magically work if the primary VM host crashes, and another VM host will take over. But it seems that's not really the case, and maybe there's some fundamental considerations that can be applied to most solutions.
The failover works using the classic clustering mechanism - a failure is detected (hypervisor unreachable), the hypervisor is fenced (multiple mechanisms and tiers supported), and the VMs that were marked HA get started on other hosts. The process should take about 2 minutes or less, depending on your settings and hardware.
This works quite well in oVirt for disaster scenarios, but these VMs come back up as if from a power outage, all in-flight data will be lost of course. If you care about state, you need to implement active-active software on top of your hypervisors, the usual VM failover will not be enough. Still, for MOST scenarios, this is plenty, and the advantage of being able to turn any software stack into an HA stack by simply marking a VM it is deployed on as HA is pretty significant.
In short, basic VM HA is a nice feature, but if you really need to not have any downtime and never lose the memory states, you will need to use software that implements active/active clustering, sharding, distributed or try and go completely stateless, so a lost node will not matter. If you specify the actual software you'll be running, maybe we here could advise on what to do with it.