I have a situation where a two-node Pacemaker cluster does not work any more after an upgrade. Package versions are pacemaker 1.1.16-1~bpo8+
and corosync 2.4.2-3~bpo8+1
under Debian Jessie.
Pacemaker is still able to start on one node. crm_node -l
then lists that node as online, the second one as lost.
Pacemaker can no longer start on the second node. The following log messages in /var/log/corosync/logfile
seem pertinent:
cib: info: validate_with_relaxng: Creating RNG parser context
pacemakerd: error: pcmk_child_exit: The cib process (1234) exited: Key has expired (127)
pacemakerd: notice: pcmk_process_exit: Respawning failed child process: cib
...
cib: info: validate_with_relaxng: Creating RNG parser context
pacemakerd: error: pcmk_child_exit: The cib process (1235) exited: Key has expired (127)
pacemakerd: notice: pcmk_process_exit: Respawning failed child process: cib
...
crmd: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
...
crmd: warning: do_cib_control: Couldn't complete CIB registration 16 times... pause and retry
crmd: notice: crm_shutdown: Shutting down cluster resource manager | limit=1200000ms
pacemakerd: notice: pcmk_shutdown_worker: Shutdown complete
So it appears as if the second node attempts CIB registration and cancels the Pacemaker start after 16 failed attempts, and that the first node coniders the second as dead perhaps because it cannot register.
Who can one get out of a situation like this?
The root cause turned out to be a too old version of package
libpe-rules2
, which provideslibpe-rules2.so
. Packagepacemaker
fromjessie-backports
requires only>= 1.0.10
(perhaps a bug in the current package description), but the current version oflibpe-rules2
(also fromjessie-backports
) is 1.1.16.The older version of the library made process
cib
fail because of undefined symbols in the dynamic library. This was revealed by startingpacemakerd
(and in effectcib
) withstrace -f
. Upgrading withapt-get install libpe-rules2
resolved the situation.