I'm building up a 2-node cluster based on CMAN + Pacemaker cluster stack, but I have no hardware STONITH devices. However both nodes are connected to a shared storage via iSCSI and I would like to use this for fencing with SBD.
SBD is supported by openAIS and Heartbeat out of the box and I've already used SBD with Heartbeat + Pacemaker cluster stack, but now I need to get it working with CMAN. Therefore I have added little code to CMAN's init script to start/stop SBD and it seems to work okay.
Regarding the watchdog: it is highly recommended to run SBD with watchdog. SBD is a critical service in such a cluster and must be running all the time the cluster software is up. The watchdog helps to ensure that - if by some reason SBD process is unexpectedly terminated, the dog is not fed anymore and it reboots the node. So I get watchdog kernel module softdog
to be loaded right before starting SBD (as it is in Heartbeat for example).
Briefly speaking I do the following in /etc/init.d/cman:
- load the kernel module -
modprobe softdog
- start SBD -
sbd -d <device> -D -W watch
And here is the problem:
when I start cman by hand
service cman start
everything is okay, but when cman starts automatically during boot, the node is rebooted.
It seems like the watchdog is not fed, because the node is fenced exactly <watchdog timeout>
seconds (5s) after it (the watchdog) is initialized by SBD (14:21:29), however the logs (/var/log/syslog) are controversial, saying SBD is running:
Jan 15 14:21:28 cs-node1 kernel: [ 12.341755] softdog: Software Watchdog Timer: 0.08 initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
...
Jan 15 14:21:29 cs-node1 sbd: [1200]: notice: Using watchdog device:/dev/watchdog
Jan 15 14:21:29 cs-node1 sbd: [1200]: info: Set watchdog timeout to 5 seconds.
...
Jan 15 14:21:30 cs-node1 sbd: [1202]: info: Latency: 1 on disk /dev/iscsi/disk2/part1
Jan 15 14:21:34 sbd: last message repeated 3 times
...
Jan 15 14:21:34 cs-node1 sbd: [1202]: info: Latency: 1 on disk /dev/iscsi/disk2/part1
Any ideas? Thanks!
p.s. Anyone with reputation over 300 points, please, consider creating the following tags: sbd or storage-based-death, stonith and cman.
Edit 1:
Now I have create a separate init script to manage SBD, that starts just before CMAN, but everything stays the same - it only works, when started manually after login. What is so special about the boot process that I don't know?
Edit 2:
Recently I have noticed that there can actually be no such messages in the logs:
Jan 15 14:21:30 cs-node1 sbd: [1202]: info: Latency: 1 on disk /dev/iscsi/disk2/part1
Jan 15 14:21:34 sbd: last message repeated 3 times
...
Jan 15 14:21:34 cs-node1 sbd: [1202]: info: Latency: 1 on disk /dev/iscsi/disk2/part1
and the node may be rebooted not exactly after 5 seconds after initializing the watchdog, but more often after 12, but every time exactly as the login prompt appears. Still though SBD is running even when there are no messages in syslog (I have added a background process that starts together with SBD and monitors its process).
0 Answers