I am following the http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_verify_corosync_installation.html document for setting up a 2 node cluster in AWS. The two nodes have pacemaker installed and FW rules are enabled. When I run the pcs status command on both the nodes, I get the message that the other node is UNCLEAN (offline).
The two nodes that I have setup are ha1p and ha2p.
OUTPUT ON ha1p
[root@ha1 log]# pcs status
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Wed Dec 24 21:30:44 2014
Last change: Wed Dec 24 21:27:44 2014
Stack: cman
Current DC: ha1p - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured
0 Resources configured
Node ha2p: UNCLEAN (offline)
Online: [ ha1p ]
Full list of resources:
OUTPUT ON ha2p
[root@ha2 log]# pcs status
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Wed Dec 24 21:30:44 2014
Last change: Wed Dec 24 21:27:44 2014
Stack: cman
Current DC: ha2p - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured
0 Resources configured
Node ha1p: UNCLEAN (offline)
Online: [ ha2p ]
Full list of resources:
Contents of /etc/cluster/cluster.conf is as below:
[root@ha1 log]# cat /etc/cluster/cluster.conf
<cluster config_version="9" name="mycluster">
<fence_daemon/>
<clusternodes>
<clusternode name="ha1p" nodeid="1">
<fence>
<method name="pcmk-method">
<device name="pcmk-redirect" port="ha1p"/>
</method>
</fence>
</clusternode>
<clusternode name="ha2p" nodeid="2">
<fence>
<method name="pcmk-method">
<device name="pcmk-redirect" port="ha2p"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk-redirect"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
Any help would be much appreciated.
Yes, you need to make sure the hostname you are using in your cluster definition is NOT the hostname in the 127.0.0.1 line in /etc/hosts.
So, my
/etc/hosts
looks like this:This happen because your cluster doesn't have full stonith configuration, In unclean state mean the cluster doesn't know the state of the node.
Maybe you can edit /etc/hosts file, and remove lines which contains 127.0.0.1 and ::1 (lines that mentions localhost). I have this exact problem and I tried using this method, and solved the problem.
The error:
Means that corosync could not contact the other corosync services running the other cluster nodes.
How to fix:
ss -tulnp|egrep ':5405.*corosync'
ip_version: ipv6
tototem
section in/etc/corosync/corosync.conf
file.getent ahosts $HOSTNAME
to see how current host name is resolved.I got the same issue and the reason was a time mismatch between both the nodes. Post ntp/chrony sync, the issue got resolved.