The environment consists of two 2012R2 Virtual Machines running RabbitMQ in high availability (ha-all) on their Queues. I use Veeam to create snapshot backups that are sent offsite as part of the DR policy.
What I am seeing are intermittent failures of the cluster when the Veeam backup occurs. When the cluster breaks it causes Mnesia events to be logged or sometimes causes one node to turn itself off completely. I believe the issue is how the VM is blipped by Veeam where it essentially pauses the VM for a brief moment and then continues. When this blip occurs both Nodes see the other vanish and the secondary promotes itself to master immediately. With two masters running as soon as they see each other (literally seconds later) they butt heads and the cluster breaks.
I read about net_ticktime
here and implemented 300 seconds thinking this would help make the cluster more resilient to the short Veeam blips, but it doesn't appear to have helped. When one node sees the other vanish the secondary promotes itself to master immediately and does not seem to utilize the net_ticktime
setting.
Example Mnesia error:
Mnesia('rabbit@Node01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@Node02'}
Has anyone else experienced this or something similar? Are there additional configuration setting with RabbitMQ or Erlang that might help make the cluster more resilient to small blips of connectivity loss between the nodes?
0 Answers