I am trying to understand the Heartbeat
setup in a new environment. It is a 2-node cluster that is still using Version 1 of Heartbeat (the one that does not use Pacemaker CRM) and I have a fundamental question that I could not find an easy to understand answer from google.
The question is, in case of a communication failure between the nodes in the cluster, but both the nodes still functioning well, how does the Cluster Manager identify which node is to be shot down? I see a ping_group
directive in /etc/ha.d/ha.cf
. From what I read, I see that the Cluster Manager will check the connectivity to any of the nodes mentioned in ping_group
and checks the connection from which cluster node is alive and from that it decides which node to be shot down(?) What if connections from both the nodes to the ping nodes are alive and only the heartbeat network between both the nodes in the cluster is down? What am I missing here?
Situation: Only the heartbeat network is down, but both the nodes are UP and fine.
root@automan00:/root : cat /etc/ha.d/ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 500ms
deadtime 30
warntime 10
initdead 120
udpport 694
baud 19200
bcast bond1 eth2
auto_failback off
node automan00
node automan01
ping_group group1 1.1.1.1 2.2.2.2
respawn hacluster /usr/lib64/heartbeat/ipfail
realtime on
# stonith directive
stonith external/riloe /etc/ha.d/riloe.cfg
Maybe you can set a crossover cable between the nodes with some private IP's as another private network on HB.
However: When communication failed between only 2 nodes you don't know which node to shoot down, this is why you need a third node before going to production.
Without the third node being able to leverage who is working properly and who is not you will find yourself with a Split Brain situation .
https://en.wikipedia.org/wiki/Split-brain_(computing)
It's not a good practice to have a kill myself tool, like a last man button or so, because you will never know what happens with the other node. If the comunication failed or the other host just went south , you will see the same behaviour, so you can not kill yourself in any of those cases. And the same goes for the other node point of view.
I know this is not a solution, but I hope it will help understand the way CRM works. If you build a cluster try to use more than 2 nodes, is that simple.