Using indirect locks of virtlockd
(which is used by libvirtd
) requires to use a cluster-wide shared filesystem like OCFS2.
In turn this means that virtlockd
must be started after the shared filesystem is mounted (otherwise the locks created would be local at best).
Naturally libvirtd
must be started after virtlockd
, and any VM after virtlockd
.
So I want for start: pacemaker, DLM, OCFS mount, virtlockd
, libvirtd
, VMs...
And for stop I want the opposite direction.
I have configured all those primitives (specifically systemd:libvirtd.service
and systemd:virtlockd
), clones and constraints correctly (I hope), but still I'm having issue with virtlockd
.
In a system like SLES15 systemd
is controlling those services, and it seems systemd
has its own life controlling (starting) services even if they are all disabled
.
So the question: Did anybody manage to succeed with such a setup?
Update (2021-02-04)
I found this "Drop-In" in the status output for virtlockd.service
:
/run/systemd/system/virtlockd.service.d/50-pacemaker.conf
It contains:
[Unit]
Description=Cluster Controlled virtlockd
Before=pacemaker.service pacemaker_remote.service
[Service]
Restart=no
A corresponding file /run/systemd/system/libvirtd.service.service.d/50-pacemaker.conf
exists:
[Unit]
Description=Cluster Controlled libvirtd.service
Before=pacemaker.service pacemaker_remote.service
[Service]
Restart=no
Could these cause the problems I'm seeing (systemd
starting libvirtd-ro.socket
, libvirts-admin.socket
and libvirtd.service
, then starting virtlockd
)?
Update (2021-02-05)
It seems the resources are started in the correct order when the node boots (e.g. after being fenced), but when pacemaker is restarted (e.g. via crm cluster restart
), systemd
interferes and starts the virtlockd
before pacemaker wants to start it.
Maybe the difference is the /run
directory.
Update (2021-02-08)
Another issue I found is that even though /etc/libvirt/libvirtd.conf
contains listen_tls = 1
, starting libvirtd
through pacemaker as indicated results in a libvirtd
not having opened the TLS socket, preventing VM live migration.
There still is some locking issue during live-migration that might be a bug in libvirtd, but I think I got the solution:
Parts of this solution are found in https://bugzilla.redhat.com/show_bug.cgi?id=1750340.
The first thing is not to use systemd's "socket activation" for libvirtd. It's not enough to
disable
all the socket units (libvirtd.socket
,libvirtd-ro.socket
,libvirtd-admin.socket
,libvirtd-tcp.socket
,libvirtd-tls.socket
), but you'll have tomask
them.Then
libvirtd
does not open the TLS socket, even iflisten_tls=1
is set. To activate that you must (in SLES 15 SP2) edit/etc/sysconfig/libvirtd
to activateLIBVIRTD_ARGS="--listen"
. At the same time you must deactivateLIBVIRTD_ARGS="--timeout 120"
to prevent automatic termination of libvirtd.Finally you'll have to start (once being configured) and stop virtlockd and libvirtd at the right point in time. I am using a multipath SAN device to host my VM images, while I use a clustered RAID1 (
/dev/md10
) to hold the locks. OCFS2 is being used as filesystem on top of both. Clustered-MD and OCFS2 both need the DLM. Cluster-wide resources use clones to distribute them. I run three test VMs once libvirt and the image path/cfs/VMI
is ready. I won't explain the steps to configure using lockd with indirect locking here, just as I won't explain how to set up a clustered MD RAID or an OCFS2 filesystem.Here is the three-node cluster configuration in crm syntax (the fencing resource being omitted):
So the essential ordering is:
The issue remaining is that libvird claims the VM is not locked at the original node during live-migration shortly before migration succeeds.
Windl,
In order to eliminate the gap between virtualization and HA product documents, I set up a highly available virtualization environment, the detailed steps are as below,
SLE HA hardware environment details:
Installation and setup process:
1. Attach iscsi disk to two cluster node, divide the iscsi disk into 3 partitions (e.g. 50MB for sbd, 20GB for ocfs2, the remain for VM image).
Add network bridge br0 on each cluster node (will be used when install/run virtual machine).
Set up password-free ssh login for the root user between cluster nodes.
2. Install HA and virtualization related packages on each cluster node.
# zypper in -t pattern ha_sles
# zypper in -t pattern kvm_server kvm_tools
3. Setup HA cluster and add SBD device.
refer to HA guide at http://docserv.suse.de/documents/en-us/sle-ha/15-SP2/single-html/SLE-HA-guide/#book-sleha-guide
4. Setup DLM and OCFS2 resources in crm.
e.g.
primitive dlm ocf:pacemaker:controld \
op monitor interval=60 timeout=60
primitive ocfs2-2 Filesystem \
params device="/dev/disk/by-id/scsi-149455400000000004100c3befec3dc9a81f9ce28f7a8b8de-part1" directory="/mnt/shared" fstype=ocfs2 \
op monitor interval=20 timeout=40
group base-group dlm ocfs2-2
clone base-clone base-group \
meta interleave=true
5. Setup virtlockd and libvirtd service on each cluster node.
edit /etc/libvirt/qemu.conf, set lock_manager = "lockd"
edit /etc/libvirt/qemu-lockd.conf, set file_lockspace_dir = "/mnt/shared/lockd" (note: /mnt/shared is ocfs2 file system mount point)
restart/enable libvirtd service (note: virtlockd service will be started by libvirtd service according to the configuration)
6. Install a virtual machine (e.g. sle15-nd) on the shared partition from one cluster node, dump the domain configuration to a XML file.
Move the virtual machine configuration file to the ocfs2 file system (e.g. /mnt/shared).
note: please make sure the XML configuration file does not include any references to unshared local paths.
7. Setup VirtualDomain resource and order in crm.
e.g.
primitive vm_nd1 VirtualDomain \
params config="/mnt/shared/sle15-nd.xml" remoteuri="qemu+ssh://%n/system" \
meta allow-migrate=true \
op monitor timeout=30s interval=10s \
utilization cpu=2 hv_memory=1024
order ord_fs_virt Mandatory: base-clone vm_nd1
8. Check all your changes with the show command in crm, then commit:
e.g.
crm(live/sle15sp2-test1)configure# show
node 172167755: sle15sp2-test2
node 172168091: sle15sp2-test1
primitive dlm ocf:pacemaker:controld \
op monitor interval=60 timeout=60
primitive ocfs2-2 Filesystem \
params device="/dev/disk/by-id/scsi-149455400000000004100c3befec3dc9a81f9ce28f7a8b8de-part1" directory="/mnt/shared" fstype=ocfs2 \
op monitor interval=20 timeout=40
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30s
primitive vm_nd1 VirtualDomain \
params config="/mnt/shared/sle15-nd.xml" remoteuri="qemu+ssh://%n/system" \
meta allow-migrate=true \
op monitor timeout=30s interval=10s \
utilization cpu=2 hv_memory=1024
group base-group dlm ocfs2-2
clone base-clone base-group \
meta interleave=true
order ord_fs_virt Mandatory: base-clone vm_nd1
property cib-bootstrap-options: \
have-watchdog=true \
dc-version="2.0.3+20200511.2b248d828-1.10-2.0.3+20200511.2b248d828" \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true
Verify if VM resource can work on HA cluster:
1. Verify VM resource is protected across cluster nodes.
Test result: cannot start the VM manually via virsh command when this VM is running on another cluster node.
2. Verify VM resource can be taken by another cluster node when the current cluster node crashes
Test result: after a few seconds (cluster fence time), the VM is started on another cluster node.
3. Verify VM resource can be taken by another cluster node when the current cluster node reboots
Test result: the VM is migrated to another cluster node.
4. Check if we can migrate VM resource between cluster nodes
Test result: Yes, the remote SSH connection to the VM is not broken during the whole migration.
Remarks:
1. In the actual production environment, cluster communication/management should use a separate network.
2. In the actual production environment, cluster sbd device should use a separate shared disk to avoid IO starvation.
3. Do not start VM instance manually until ocfs2 file system is mounted, since the file lockspace directory is under ocfs2 file system. In other words, you should let cluster(pacemaker) manage the start and stop of all virtual machines.