I need to build 2-node cluster(-like?) solution in active-passive mode, that is, one server is active while the other is passive (standby) that continuously gets the data replicated from active. KVM-based virtual machines would be running on active node.
In case of the active node being unavailable for any reason I would like to manually switch to the second node (becoming active and the other passive).
I've seen this tutorial: https://www.alteeve.com/w/AN!Cluster_Tutorial_2#Technologies_We_Will_Use
However, I'm not brave enough to trust fully automatic failover and build something that complex and trust it to operate correctly. Too much risk of split-brain situation, complexity failing somehow, data corruption, etc, while my maximum downtime requirement is not so severe as to require immediate automatic failover.
I'm having trouble finding information on how to build this kind of configuration. If you have done this, please share the info / HOWTO in an answer.
Or maybe it is possible to build highly reliable automatic failover with Linux nodes? The trouble with Linux high-availability is that there seems to have been a surge of interest in the concept like 8 years ago and many tutorials are quite old by now. This suggests that there may have been substantial problems with HA in practice and some/many sysadmins simply dropped it.
If that is possible, please share the info how to build it and your experiences with clusters running in production.
Why not using things which have been checked by thousands of users and proved their reliability? You can just deploy free Hyper-V server with, for example, StarWind VSAN Free and get true HA without any issues. Check out this manual: https://www.starwindsoftware.com/resource-library/starwind-virtual-san-hyperconverged-2-node-scenario-with-hyper-v-server-2016
I have a very similar installation with the setup you described: a KVM server with a stanby replica via DRBD active/passive. To have a system as simple as possible (and to avoid any automatic split-brain, ie: due to my customer messing with the cluster network), I also ditched automatic cluster failover.
The system is 5+ years old and never gave me any problem. My volume setup is the following:
I wrote some shell scripts to help me in case of failover. You can found them here
Please note that the system was architected for maximum performance, even at the expense of features as fast snapshots and file-based (rather than volume-based) virtual disks.
Rebuilding a similar, active/passive setup now, I would heavily lean toward using ZFS and continuous async replication via
send/recv
. It is not real-time, block based replication, but it is more than sufficient for 90%+ case.If realtime replication is really needed, I would use DRBD on top of a ZVOL + XFS; I tested such a setup + automatic pacemaker switch in my lab with great satisfaction, in fact. If using 3rdy part modules (as ZoL is) is not possible, I would use a DRBD resources on top of a
lvmthin
volume + XFS.You can totally setup DRBD and use it in a purely manual fashion. The process should not be complex at all. You would simply do what a Pacemaker or Rgmanager cluster does, but by hand. Essentially:
Naturally, this will require that both nodes have the proper packages installed, and the VM's configurations and definition exist on both nodes.
I can assure that the Linux HA stack (corosync and pacemaker) are still actively developed and supported. Many guides are old, the software has been around for 10 years. When done properly, there are no major problems or issues. It is not abandoned, but it is no longer "new and exciting".
Active/Passive clusters are still heavilly used in many places, and running in production. Please find below our production setup, it is working fine, and you can either let it run in manual mode (
orchestrate=start
), or enable automatic failover (orchestrate=ha
). We use zfs to benefit from zfs send/receive, and zfs snapshots, but it is also possible to use drbd if you prefer synchronous replication.Prerequisites :
Steps :
A few explanations :
{svcname}
in the service config file is a reference pointing to actual service name (win1)data/win1
on mountpoint/srv/win1
win1
sync#1
is used to declare an asynchronous zfs dataset replication to the slave node (data/win1 on node1 is sent to data/win1 on node2), once per 12 hours in the example (zfs send/receive is managed by the opensvc agent)Some management commands :
svcmgr -s win1 start
start the servicesvcmgr -s win1 stop
stop the servicesvcmgr -s win1 stop --rid container#0
stop the container referenced container#0 in the config filesvcmgr -s win1 switch
relocate the service to the other nodesvcmgr -s win1 sync update
trigger an incremental zfs dataset copysvcmgr -s win1 sync full
trigger a full zfs dataset copySome services I manage also need zfs snapshots on a regular basis (daily/weekly/monthly), with retention, in this case I add the following config snippet to the service configuration file, and the opensvc agent does the job.
As requested, I also add one lvm/drbd/kvm config :
drbd resource config
/etc/drbd.d/kvmdrbd.res
:opensvc service config file
/etc/opensvc/kvmdrbd.conf
:Some explanations :
disk#1
: is the lvm vg hosting the big logical volume. should be at least 5GB.disk#2
: is the drbd disk pointed by the drbd resource name. If opensvc service is named "foo", you should have /etc/drbd.d/foo.res. Or change disk#2.res parameter in the service config file.fs#0
: the main filesystem hosting all disk files for kvm guestcontainer#0
: the kvm guest, same name as the opensvc service in the example. agent must be able to dns resolve the kvm guest, to do a ping check before accepting to start the service (if ping answer, the kvm guest is already running somewhere, and it is not a good idea to start it. double start protection ensured by opensvc agent)standby = true
: mean that this resource must remain up when the service is running on the other node. In our example, it is needed to keep drbd running fineshared = true
: https://docs.opensvc.com/latest/agent.service.provisioning.html#shared-resourcesI'm currently up to an extremely similar system. 2 servers, one active, one backup and they both have a few VMs running inside them. Database is being replicated and the fileservers are in constant sync with rsync (but only one way). In case of emergency, the secondary server is being served. There was the idea of using Pacemaker and Corosync but since this has to be 100%, I didn't have the courage to experiment. My idea is NginX watching over the servers. This could be done because I'm using a webapplication, but in your case, I don't know if you could use it. DRBD is a mess for me. The previous servers were using it and while it seemingly worked, it felt like I'm trying to dissect a human body.
Check this out, it might help you: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs
It doesn't look hard, in fact, in a small environment I've already tried it and worked. Easy to learn, easy to make, easy to maintain. Actually I think this is what you are looking for.