I've installed Red Hat's cluster software on an install of CentOS 6.5, and use it to provide redundant routing from one network to another. This works fine, and I have a pair of boxes providing the service, so that if one fails (for example, if I test by removing its network connections), the other takes over routing.
However, if I then have to do anything to the remaining box, I can't restart it due to problems with rgmanager:
service rgmanager stop
hangs, and the only way to stop the process is to kill -9
it. This obviously also affects any action that tries to stop the service, like a reboot
or poweroff
.
When I do manage to start the server on its own, although the cluster starts, rgmanager is not shown as running in clustat
and none of the redundant routing services are even visible, let alone start.
This could cause problems if, for instance, the boxes are deployed to a remote location, and need to be powered down before we've had a chance to replace the failed box.
Here's my cluster.conf:
<?xml version="1.0"?>
<cluster config_version="2" name="router-ha">
<fence_daemon/>
<clusternodes>
<clusternode name="router-01" nodeid="1"/>
<clusternode name="router-02" nodeid="2"/>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices/>
<rm>
<failoverdomains/>
<resources>
<ip address="10.0.0.1" monitor_link="1" sleeptime="0"/>
<ip address="10.0.0.2" monitor_link="1" sleeptime="0"/>
<ip address="10.2.0.1" monitor_link="1" sleeptime="0"/>
<ip address="10.4.0.1" monitor_link="1" sleeptime="0"/>
</resources>
<service autostart="1" name="routing-a" recovery="restart">
<ip ref="10.0.0.1"/>
<ip ref="10.2.0.1"/>
</service>
<service autostart="1" name="routing-b" recovery="restart">
<ip ref="10.0.0.2"/>
<ip ref="10.4.0.1"/>
</service>
</rm>
</cluster>
Why can't I start the service on a single box if it can't see the other? Surely it's a required part of being a redundant pair that you don't depend on the other machine to be able to start a cluster service?
To run clustered services, a quorum is needed. Typically in for example a three-node cluster, every member has one vote each: If you pull the plug on one, it will know that it is inquorate as it has less than half of the available votes (the value is actually configurable). A cluster without quorum is not fit to run clustered services on.
This principle is not just for Red Hat clusters, but a general principle. The solutions and implementations may vary though. Also the implementations of two-node clusters, since if you give both nodes one vote each, not a single one of them would normally be quorate.
In the case of Red Hat, in a two-node cluster, a special condition apply:
What will happen when you pull the plug is that both nodes will lose contact with each other.
To determine which one of them has quorum, they will both try to STONITH each other.
So the assumption you make is correct, your cluster is misconfigured and requires a fencing agent to be operational, because by pulling the plug, not only do you make your service unavailable, which would normally cause
rgmanager
to fail-over to other node (or how ever you've configured it to do), you also remove the heartbeat-link between the clustered nodes. Even thoughrgmanager
may try to do what you've configured it to do,cman
still cannot figure out which one of these nodes has quorum. Instead it will consistently try to fence the other node, but since you've no fence-agent configured, it will be stuck indefinitly.So here are two good advice to you: