OK! Really new to pacemaker/corosync, like 1 day new.
Software: Ubuntu 18.04 LTS and the versions associated with that distro.
pacemakerd: 1.1.18
corosync: 2.4.3
I accidentally removed the nodes from my entire test cluster (3 nodes)
When I tried to bring everything back up using pcsd
GUI, that failed because the nodes were "wiped out". Cool.
So. I had a copy of the last corosync.conf
from my "primary" node. I copied to the other two nodes. I fixed the bindnetaddr
on the respective confs. I ran pcs cluster start
on my "primary" node.
One of the nodes failed to come up. I took a look at the status of pacemaker
on that node and I get the following exception:
Dec 18 06:33:56 region-ctrl-2 crmd[1049]: crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down
I tried running crm_node -R --force 1084777441
on the machine where pacemaker
won't start, but of course, pacemaker
isn't running so I get an crmd: connection refused (111)
error. So, I ran the same command on one of the healthy nodes, which shows no errors, but the node never goes away and pacemaker
on the affected machine continued to show the same error.
So, I decided to tear down the entire cluster and again. I purge removed all the packages from the machine. I reinstalled everything fresh. I copied and fixed the corosync.conf
to the machine. I recreated the cluster. I get the exact same bloody error.
So this node named 1084777441
is not a machine I created. This is one the cluster created for me. Earlier in the day I realized that I was using IP addresses in corosync.conf
instead of names. I fixed the /etc/hosts
of the machines, removed the IP addresses from the corosync config, and that's why I inadvertently deleted my whole cluster in the first place (I removed the nodes that were IP addresses).
The following is my corosync.conf:
totem {
version: 2
cluster_name: maas-cluster
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
bindnetaddr: 192.168.99.225
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 3
two_node: 1
}
nodelist {
node {
ring0_addr: postgres-sb
nodeid: 3
}
node {
ring0_addr: region-ctrl-2
nodeid: 2
}
node {
ring0_addr: region-ctrl-1
nodeid: 1
}
}
The only thing different about this conf between the nodes is the bindnetaddr
.
There seems to be a chicken/egg issue here unless there's some way of which I'm not aware to remove a node from a flat-file db or sqlite dbb somewhere or there's some other more authoritative way to remove a node from a cluster.
ADDITIONAL
I've made sure that /etc/hosts
and the hostname of each of the machines match. I forgot to mention that.
127.0.0.1 localhost
127.0.1.1 postgres
192.168.99.224 postgres-sb
192.168.99.223 region-ctrl-1
192.168.99.225 region-ctrl-2
192.168.7.224 postgres-sb
192.168.7.223 region-ctrl-1
192.168.7.225 region-ctrl-2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I decided to try to start from scratch. I apt removed --purge
ed corosync*
, pacemaker*
crmsh
, and pcs
. I rm -rf
ed /etc/corosync
. I kept a copy of the corosync.conf
on each machine.
I re-installed all the things on each of the machines. I copied my saved corosync.conf
to /etc/corosync/
and restarted corosync
on all the machines.
I STILL get the same exact error. This has to be a bug in one of the components!
So it seems that crm_get_peer
is failing to recognize that the host named region-ctrl-2
is assigned nodeid 2 in corosync.conf
. Node 2 then get auto-assigned an ID of 1084777441. This is the part that doesn't make sense to me. The hostname of the machine is region-ctrl-2
set in /etc/hostname
and /etc/hosts
and confirmed using uname -n
. The corosync.conf
is explicitly assigning an ID to the machine named region-ctrl-2
but something is apparently not recognizing the assignment from corosync
and instead assigned a non-randomized ID with the value 1084777441 to this host. How the freak do I fix this?
LOGS
info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
info: get_cluster_type: Detected an active 'corosync' cluster
info: qb_ipcs_us_publish: server name: pacemakerd
info: pcmk__ipc_is_authentic_process_active: Could not connect to lrmd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to cib_ro IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to crmd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to attrd IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to pengine IPC: Connection refused
info: pcmk__ipc_is_authentic_process_active: Could not connect to stonith-ng IPC: Connection refused
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
info: crm_get_peer: Created entry ea4ec23e-e676-4798-9b8b-00af39d3bb3d/0x5555f74984d0 for node (null)/1084777441 (1 total)
info: crm_get_peer: Node 1084777441 has uuid 1084777441
info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
notice: cluster_connect_quorum: Quorum acquired
info: crm_get_peer: Created entry 882c0feb-d546-44b7-955f-4c8a844a0db1/0x5555f7499fd0 for node postgres-sb/3 (2 total)
info: crm_get_peer: Node 3 is now known as postgres-sb
info: crm_get_peer: Node 3 has uuid 3
info: crm_get_peer: Created entry 4e6a6b1e-d687-4527-bffc-5d701ff60a66/0x5555f749a6f0 for node region-ctrl-2/2 (3 total)
info: crm_get_peer: Node 2 is now known as region-ctrl-2
info: crm_get_peer: Node 2 has uuid 2
info: crm_get_peer: Created entry 5532a3cc-2577-4764-b9ee-770d437ccec0/0x5555f749a0a0 for node region-ctrl-1/1 (4 total)
info: crm_get_peer: Node 1 is now known as region-ctrl-1
info: crm_get_peer: Node 1 has uuid 1
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
warning: crm_find_peer: Node 1084777441 and 2 share the same name: 'region-ctrl-2'
info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
info: pcmk_quorum_notification: Quorum retained | membership=32 members=3
notice: crm_update_peer_state_iter: Node region-ctrl-1 state is now member | nodeid=1 previous=unknown source=pcmk_quorum_notification
notice: crm_update_peer_state_iter: Node postgres-sb state is now member | nodeid=3 previous=unknown source=pcmk_quorum_notification
notice: crm_update_peer_state_iter: Node region-ctrl-2 state is now member | nodeid=1084777441 previous=unknown source=pcmk_quorum_notification
info: crm_reap_unseen_nodes: State of node region-ctrl-2[2] is still unknown
info: pcmk_cpg_membership: Node 1084777441 joined group pacemakerd (counter=0.0, pid=32765, unchecked for rivals)
info: pcmk_cpg_membership: Node 1 still member of group pacemakerd (peer=region-ctrl-1:900, counter=0.0, at least once)
info: crm_update_peer_proc: pcmk_cpg_membership: Node region-ctrl-1[1] - corosync-cpg is now online
info: pcmk_cpg_membership: Node 3 still member of group pacemakerd (peer=postgres-sb:976, counter=0.1, at least once)
info: crm_update_peer_proc: pcmk_cpg_membership: Node postgres-sb[3] - corosync-cpg is now online
info: pcmk_cpg_membership: Node 1084777441 still member of group pacemakerd (peer=region-ctrl-2:3016, counter=0.2, at least once)
pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
lrmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
lrmd: info: qb_ipcs_us_publish: server name: lrmd
pengine: info: qb_ipcs_us_publish: server name: pengine
cib: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: info: get_cluster_type: Verifying cluster type: 'corosync'
attrd: info: get_cluster_type: Assuming an active 'corosync' cluster
info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
attrd: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
cib: info: get_cluster_type: Verifying cluster type: 'corosync'
cib: info: get_cluster_type: Assuming an active 'corosync' cluster
info: get_cluster_type: Verifying cluster type: 'corosync'
info: get_cluster_type: Assuming an active 'corosync' cluster
notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: info: validate_with_relaxng: Creating RNG parser context
crmd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
crmd: info: get_cluster_type: Verifying cluster type: 'corosync'
crmd: info: get_cluster_type: Assuming an active 'corosync' cluster
crmd: info: do_log: Input I_STARTUP received in state S_STARTING from crmd_init
attrd: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
attrd: info: crm_get_peer: Created entry af5c62c9-21c5-4428-9504-ea72a92de7eb/0x560870420e90 for node (null)/1084777441 (1 total)
attrd: info: crm_get_peer: Node 1084777441 has uuid 1084777441
attrd: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
attrd: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
attrd: info: init_cs_connection_once: Connection to 'corosync': established
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
info: crm_get_peer: Created entry 5bcb51ae-0015-4652-b036-b92cf4f1d990/0x55f583634700 for node (null)/1084777441 (1 total)
info: crm_get_peer: Node 1084777441 has uuid 1084777441
info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
attrd: info: corosync_node_name: Unable to get node name for nodeid 1084777441
attrd: notice: get_node_name: Defaulting to uname -n for the local corosync node name
attrd: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
info: init_cs_connection_once: Connection to 'corosync': established
info: corosync_node_name: Unable to get node name for nodeid 1084777441
notice: get_node_name: Defaulting to uname -n for the local corosync node name
info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
cib: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: notice: get_node_name: Could not obtain a node name for corosync nodeid 1084777441
cib: info: crm_get_peer: Created entry a6ced2c1-9d51-445d-9411-2fb19deab861/0x55848365a150 for node (null)/1084777441 (1 total)
cib: info: crm_get_peer: Node 1084777441 has uuid 1084777441
cib: info: crm_update_peer_proc: cluster_connect_cpg: Node (null)[1084777441] - corosync-cpg is now online
cib: notice: crm_update_peer_state_iter: Node (null) state is now member | nodeid=1084777441 previous=unknown source=crm_update_peer_proc
cib: info: init_cs_connection_once: Connection to 'corosync': established
cib: info: corosync_node_name: Unable to get node name for nodeid 1084777441
cib: notice: get_node_name: Defaulting to uname -n for the local corosync node name
cib: info: crm_get_peer: Node 1084777441 is now known as region-ctrl-2
cib: info: qb_ipcs_us_publish: server name: cib_ro
cib: info: qb_ipcs_us_publish: server name: cib_rw
cib: info: qb_ipcs_us_publish: server name: cib_shm
cib: info: pcmk_cpg_membership: Node 1084777441 joined group cib (counter=0.0, pid=0, unchecked for rivals)
After working with clusterlabs a bit, I was able to find a fix for this. The fix was fix
/etc/corosync/corosync.conf
by addingtransport: udpu
in thetotem
directive and to making sure all nodes are properly added in thenodelist
directive. If using nodes by name only then one needs to make sure that the nodes are properly resolvable which is done usually in/etc/hosts
. Once thecorosync.conf
is fixed, restart the entire cluster. In my case, the followingcorosync.conf
was the fixed version: