It seems that a Galera cluster deadlocks every time when I run a mysqldump or table optimize.
I've ran a "mysqlcheck" and a "mysqldump" on my MariaDB 10.1 database server several times. (which runs in a Galera cluster with two other servers)
I've noticed that both tasks stop and don't show any progress after running for a short time.
For example the mysqldump stops after creating an 14,0 MB (14.760.912 bytes) dump file and doesn't proceed.
The mysqlcheck to repair and optimize tables also hangs.
In both situations the cluster starts to have issues and the only way to get it to work normally again is by taking the server that executed the job offline and also taking another server offline. I then take them back online on by one and the cluster works normally again.
I'm not sure what is causing these problems. I haven't found any errors in the syslog, although during the shutdown of the servers I noticed the following:
Jan 10 20:43:46 france mysqld[1015]: 2016-01-10 20:43:46 140096330258176 [Warning] WSREP: TO isolation failed for: 3, schema: mysql, sql: OPTIMIZE TABLE proc
. Check wsrep connection state and retry the query.
Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691511322368 [Warning] WSREP: TO isolation failed for: 3, schema: smf, sql: OPTIMIZE TABLE smf_categories
. Check wsrep connection state and retry the query.
Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691511322368 [Warning] Aborted connection 24 to db: 'smf' user: 'maintenance' host: 'localhost' (Unknown error)
Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691509827328 [Warning] WSREP: TO isolation failed for: 3, schema: (null), sql: SELECT 1 FROM mysql.user LIMIT 1. Check wsrep connection state and retry the query.
I've found the issue to be with Galera. When I take the server out of the cluster the optimize and dump job run much quicker and complete correctly.
I was pointed into the right direction by Tom on the MariaDB KB: https://mariadb.com/kb/en/mariadb/galera-cluster-fail-during-dump-or-optimize/#comment_1911
The issue appears to be caused by flow control. I've resolved it by tuning the flow control settings.
I did this by adding the following wsrep_provider_options: gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0
However it also created a new problem, the job now finishes correctly but after that the cluster still go's down.