Ping a Specific Port

Question

Julien

Asked: 2014-11-19 21:13:26 +0800 CST2014-11-19 21:13:26 +0800 CST 2014-11-19 21:13:26 +0800 CST

Broken RabbitMQ cluster wont 'restart

772

I run RabbitMQ on 3 servers, same version of Erlang and RabbitMQ: RabbitMQ 3.4.1, Erlang 17.3

One node crashed on server 2. The two other nodes did not connect together:

server 1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

server 3:

[de3 ~]$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@de3 ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,[rabbit@de3]},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

After restarting and resetting rabbitmq on server 3, it finally connected to server1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

Why did the cluster "break" with just 1 node down? server 3 was working fine, but server 1 was not: "Queue is located on a server that is down".

As for server 2, it did not restart. After a manual restart, I cannot make it reconnect to the cluster, even after multiple reset and removing /var/lib/rabbitmq/mnesia/:

[root@mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]}]

[mysql ~]# rabbitmqctl stop_app
Stopping node rabbit@mysql ...
[root@mysql ~]# rabbitmqctl force_reset
Forcefully resetting node rabbit@mysql ...
[ysql ~]# rabbitmqctl join_cluster rabbit@CentOS-62-64-minimal
Clustering node rabbit@mysql with 'rabbit@CentOS-62-64-minimal' ...
Error: {ok,already_member}
[mysql ~]# rabbitmqctl start_app
Starting node rabbit@mysql ...
[mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]}]

I have no idea what went wrong. Last time this happened, I upgraded RabbitMQ qnd Erlang to the latest version.

3 Answers

Voted

bodgit · Answer 1 · 2014-11-22T06:18:22+08:00

bodgit

2014-11-22T06:18:22+08:002014-11-22T06:18:22+08:00

Based on the RabbitMQ cluster documentation your rabbitmqctl cluster_status output looks wrong; running_nodes should contain more than just the local node where you are running the command. That suggests to me that they can't talk to each other properly, are there any firewalls in between the nodes?

4

Craig · Answer 2 · 2014-11-25T11:35:14+08:00

Bodgit is correct, I can tell you from having an operational rabbit cluster that your configuration is wrong. It looks like each node is it's own cluster with only itself as the current running node.

Please refer back to the RabbitMQ doc on setting up the cluster.

You should see something much more like the following on each node:

    root@rabbit0:~# rabbitmqctl cluster_status
    Cluster status of node 'rabbit@rabbit0' ...
    [{nodes,[{disc,['rabbit@rabbit0','rabbit@rabbit1']}]},
     {running_nodes,['rabbit@rabbit1','rabbit@rabbit0']},
     {cluster_name,<<"[email protected]">>},
     {partitions,[]}]
    ...done.

    root@rabbit1:~# rabbitmqctl cluster_status
    Cluster status of node 'rabbit@rabbit1' ...
    [{nodes,[{disc,['rabbit@rabbit0','rabbit@rabbit1']}]},
     {running_nodes,['rabbit@rabbit0','rabbit@rabbit1']},
     {cluster_name,<<"[email protected]">>},
     {partitions,[]}]
    ...done.

This is sanitized but the orders and intent is kept.

You also need to configure high availability if you want your queues to fail over:

https://www.rabbitmq.com/ha.html

joshua · Answer 3 · 2017-04-21T13:52:52+08:00

Best Answer

joshua

2017-04-21T13:52:52+08:002017-04-21T13:52:52+08:00

I had this issue today designing an intentional break document for a breakfix event to teach our operations team how to fix stuff. I intentionally unclustered a node and was unable to run the rabbitmqctl join_cluster successfully because the cluster believed the node to already be a member.

Clustering node 'rabbit@node-1' with 'rabbit@node-0' ... ...done (already_member).

Ultimately what worked for me was rabbitmqctl forget_cluster_node rabbit@node-1 from a working clustered node. Once I did that, I was able to successfully run rabbtmqctl join_cluster rabbit@node-0

1

Broken RabbitMQ cluster wont 'restart

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?