I'm having a few issues trying to understand ha.cf and how the cluster picks up on updates.
For example, when creating a new cluster, I usually:
- Set some default options in ha.cf on node 1 - node x
- Start the cluster.
- Run crm on any node, configure resources.
Whilst I usually do nodes up/down, resources up/down, I have never actually added a new node at a later date.
Just for "fun", I decided to run a new server that only specified one node in the cluster in it's ha.cf, and then start heartbeat.
This machine successfully joined the cluster and added itself to every other node in the cluster.... Where I get confused is that even if I shutdown all nodes, and reboot the original 2 nodes, they both still have the third server as in the cluster but offline, despite the third not being in the original 2 node's ha.cf file.
Even if I edit ha.cf and change some nonsense value/or touch the file, reboot the server and cluster, it is still there. So my conclusion is that CIB takes preference over ha.cf, but, what I don't get is why/how.
I'm really looking for best practices - should any machine just have enough in ha.cf to "get it up", then do everythign in CRM? Is ha.cf a waste of time, or should I be using it a lot more?
Trying not to be so vague - I'm really just looking for what I should be doing in CRM, and what I should be doing in ha.cf?
Thanks,
Wil
I was really hoping to see a good answer myself.
All I can really do is endorse your experiences: that the only real function of heartbeat in these circumstances is to start pacemakerd, the CRM subsystem. This (as you know ) maintains its own database of nodes and state, which on my systems is
/var/lib/heartbeat/crm/cib.xml
. The files in/etc/ha.d
informheartbeat
, but notcrm
.I am running a number of failover pairs doing various things, most of which have been up for over 500 days and some of which are close to 1000 days, and most of which have survived any number of failovers and failbacks; so I can only assume I'm doing something right. My practice is not to actually lie in
ha.cf
, but to put almost nothing in there other than what is required to get HA to start up CRM.I'm sorry I don't have anything more concrete to point you at.
Apparently, you run Pacemaker, a Cluster Resource Manager, on top of Heartbeat v3, a cluster messaging layer. You may find more info here. For instance, older versions of Heartbeat have required users to add ping node configuration to ha.cf, this is no longer required with pingd ressource agent in Pacemaker.
The role of a resource agent is to abstract the service it provides and present a consistent view to the cluster, which allows the cluster to be agnostic about the resources it manages. The cluster doesn't need to understand how the resource works because it relies on the resource agent to do the right thing when given a start, stop or monitor command.
So you should distinguish the configurations and check the following in your
Let me also suggest the following tests:
Do you reread the good heartbeat service?
kill -HUP $GoodHeartbeatPID
CRM need a commit (cib.xml (aka. Cluster Information Base) is generate by this command)
crm_verify -L -V
cib commit $yourconf
Check also your hosts /etc/hosts, DNS etc.
Be careful with restart order
on your still-active node. This will shutdown your cluster resources.
on your standby node (the one where you created your CIB). This will start the local Heartbeat instance and Pacemaker, and wait for other cluster nodes to check in.
on your the other node. This will start the local Heartbeat instance and Pacemaker, fetch the CIB automatically, and start applications.
Kind regards