I am having the followig problem with an OenSuSE + Heartbeat + Pacemaker + Xen HA cluster: when the node a Xen domU is running on is "dead" the Xen domU running on it is not restarted on the second node.
The cluster is setup with two nodes, each running OpenSuSE-11.3, Heartbeat 3.0, and Pacemaker 1.0 in CRM mode. For storage I am using a LUN on an iSCSI SAN device; the LUN is formatted with OCFS2 and managed with LVM. The Xen domU has two logical volumes; one for root and the other for swap. I am using IPMI cards for STONITH devices, and a dedicated ethernet link for heartbeat communications.
The ha.cf file is as follows:
keepalive 1
deadtime 10
warntime 5
udpport 694
ucast eth1
auto_failback off
node dhcp-166
node stage
use_logd yes
crm yes
My resources look as follows:
shocrm(live)configure# show
node $id="5c1aa924-bba4-4f95-a367-6c9a58ac4a38" dhcp-166
node $id="cebc92eb-af24-4833-aaf0-672adf80b58e" stage
primitive Xen-Util ocf:heartbeat:Xen \
meta target-role="Started" \
operations $id="Xen-Util-operations" \
op start interval="0" timeout="60" start-delay="0" \
op stop interval="0" timeout="120" \
params xmfile="/etc/xen/vm/xen-util"
primitive my-stonith stonith:external/ipmi \
params hostname="dhcp-166" ipaddr="192.168.3.106" userid="ADMIN" passwd="xxx" \
op monitor interval="2m" timeout="60s"
primitive my-stonith2 stonith:external/ipmi \
params hostname="stage" ipaddr="192.168.3.105" userid="ADMIN" passwd="xxx" \
op monitor interval="2m" timeout="60s"
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat"
The Xen domU config file is as follows:
name = "xen-util"
bootloader = "/usr/lib/xen/boot/domUloader.py"
#bootargs = "xvda1:/vmlinuz-xen,/initrd-xen"
bootargs = "--entry=xvda1:/boot/vmlinuz-xen,/boot/initrd-xen"
memory = 4096
disk = [ 'phy:vg_xen/xen-util-root,xvda1,w',
'phy:vg_xen/xen-util-swap,xvda2,w', ]
root = "/dev/xvda1"
vif = [ 'mac=00:16:3e:42:42:06' ]
#vfb = [ 'type=vnc,vncunused=0,vnclisten=192.168.3.172' ]
extra = ""
Say domU "Xen-Util" is running on node "stage"; if "stage" goes down, "Xen-Util" does not restart on node "dhcp-166". It seems to want to try as an "xm list" will show it for a few seconds and if you "xm console xen-util" it will give a message like "copying /boot/kernel.gz from xvda1 to /var/lib/xen/tmp/kernel.a53gs for booting". However, it never gets past that, eventually gives up, and no longer appears in "xm list". Now, when node "stage" comes back online after being power cycled, it detects that "Xen-Util" isn't running, and starts it (on stage).
I've tried starting "Xen-Util" on node "dhcp-166" without the cluster running, and it works fine. No problems. So, I know it works in that respect.
Any ideas? Thanks!
I figured it out, after a bit more trial and error. I was getting iSCSI errors going back up the stack too rapidly, as mentioned in this post on ServerFault.
In addition to changing the variables outlined in the post above, I also traced some network cables and discovered that node #2 was on a 100Mb link, while node #1 was on a Gig link, along with the SAN. After some careful shuffling, all the network connections are now running at Gig speeds.
Finally, I changed the MTU on the Linux interfaces to 9000 from 1500, which seems to also have sped things up a bit.
The final result is a working cluster with the domU booting even quicker than before on node #1.
Cheers,
Kendall