Ping a Specific Port

Question

ceejayoz

Asked: 2013-12-26 20:53:57 +0800 CST2013-12-26 20:53:57 +0800 CST 2013-12-26 20:53:57 +0800 CST

new Percona XtraDB nodes unable to join cluster

772

I have a four node Percona XtraDB cluster. Each night, one node shuts down for a backup and subsequently rejoins the cluster. This has run fine for quite some time now.

Last night, the backup node refused to rejoin the cluster. The other three nodes are humming along nicely, but any attempt to join the failed node is unsuccessful.

Luckily, we provision all of our nodes using Ansible, so we've got a repeatable recipe for blowing away a failed node and replacing it with a brand new, from scratch database node.

Unluckily, new nodes also fail. No changes to the Ansible scripts have been made recently and we've successfully launched instances since. This is all running on a two-AZ VPC in Amazon AWS, if that makes any difference.

The new server's MySQL log says:

131226 04:44:50 mysqld_safe Starting mysqld daemon with databases from /data/mysql
131226 04:44:50 mysqld_safe Skipping wsrep-recover for empty datadir: /data/mysql
131226 04:44:50 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
131226  4:44:50 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
131226  4:44:50 [Note] WSREP: Read nil XID from storage engines, skipping position init
131226  4:44:50 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
131226  4:44:50 [Note] WSREP: wsrep_load(): Galera 3.2(r170) by Codership Oy <info@codership.com> loaded successfully.
131226  4:44:50 [Note] WSREP: CRC-32C: using hardware acceleration.
131226  4:44:50 [Warning] WSREP: Could not open saved state file for reading: /data/mysql//grastate.dat
131226  4:44:50 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
131226  4:44:50 [Note] WSREP: Passing config to GCS: base_host = 10.0.1.226; base_port = 4567; cert.log_conflicts = no; gcache.dir = /data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
131226  4:44:50 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
131226  4:44:50 [Note] WSREP: wsrep_sst_grab()
131226  4:44:50 [Note] WSREP: Start replication
131226  4:44:50 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
131226  4:44:50 [Note] WSREP: protonet asio version 0
131226  4:44:50 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
131226  4:44:50 [Note] WSREP: backend: asio
131226  4:44:50 [Note] WSREP: GMCast version 0
131226  4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
131226  4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
131226  4:44:50 [Note] WSREP: EVS version 0
131226  4:44:50 [Note] WSREP: PC version 0
131226  4:44:50 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.0.2.180:,10.0.1.226:,10.0.1.21:,10.0.2.236:'
131226  4:44:50 [Warning] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') address 'tcp://10.0.1.226:4567' points to own listening address, blacklisting
131226  4:44:53 [Warning] WSREP: no nodes coming from prim view, prim not possible
131226  4:44:53 [Note] WSREP: view(view_id(NON_PRIM,7497477a-6de8-11e3-9a71-7af4d47d85be,1) memb {
    7497477a-6de8-11e3-9a71-7af4d47d85be,0
} joined {
} left {
} partitioned {
})
131226  4:44:54 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.5049S), skipping check
131226  4:45:23 [Note] WSREP: view((empty))
131226  4:45:23 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
     at gcomm/src/pc.cpp:connect():141
131226  4:45:23 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
131226  4:45:23 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.0.2.180,10.0.1.226,10.0.1.21,10.0.2.236': -110 (Connection timed out)
131226  4:45:23 [ERROR] WSREP: gcs connect failed: Connection timed out
131226  4:45:23 [ERROR] WSREP: wsrep::connect() failed: 7
131226  4:45:23 [ERROR] Aborting

131226  4:45:23 [Note] WSREP: Service disconnected.
131226  4:45:24 [Note] WSREP: Some threads may fail to exit.
131226  4:45:24 [Note] /usr/sbin/mysqld: Shutdown complete

131226 04:45:24 mysqld_safe mysqld from pid file /data/mysql/mysql.pid ended

The existing nodes in the cluster spit the following out whenever the new node attempts to join:

131226  4:45:23 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error)
 at gcomm/src/gcomm/datagram.hpp:unserialize():101

Has anyone encountered this? Googling reveals little and it makes me anxious that our repeatable server config is not currently repeatable.

2 Answers

Voted

ceejayoz · Answer 1 · 2013-12-27T07:28:21+08:00

ceejayoz

2013-12-27T07:28:21+08:002013-12-27T07:28:21+08:00

This turns out to be a side effect of the Galera 2.x -> 3.x switch.

The new server got the newer Galera 3.x packages. In a mixed cluster, it needs:

wsrep_provider_options = 'socket.checksum = 1;'

in /etc/my.cnf for the 3.x nodes so they can talk to the 2.x nodes.

Many thanks to Alex Yurchenko (ayurchen) over on Launchpad.

1

Michael Hampton · Answer 2 · 2013-12-26T21:33:15+08:00

Michael Hampton

2013-12-26T21:33:15+08:002013-12-26T21:33:15+08:00

The only other place on the Internet where this particular error is referenced is in comment 9 of bug 1153727, which is marked as fixed. I would suggest upgrading the "functioning" nodes to the latest available version of Percona XtraDB Cluster. You should then be able to join new nodes of that latest version.

0

new Percona XtraDB nodes unable to join cluster

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?