How do I perform crash recovery in a MySQL/InnoDB Master-Slave asynchronous replication setup?
Specifically:
If a slave crashes, how do I make it syncs up with the master after I bring it back up?
If the master crashes, a slave will become the master. How do I make the new master syncs up with other slaves? And when the original master is brought back up, how do I sync it with the new master?
Because replication is async, a transaction that has been committed to the master may not be able to leave the master before the crash happens. So there could be inconsistency between the original master and slaves, one of which will become the new master.
Likewise, the slave that are promoted to become the new master may not have the most up-to-date transactions among all the slaves. So the new master could be "behind" one of its slaves.
How do I resolve all these potential inconsistencies?
Any tools that help with these tasks?
Thanks.
I would recommend avoiding MMM like the plague. It is very risky software and causes much more downtime than it prevents. I have extensive experience with it and my company tried to fix its problems, but it's unfixable. I'm not sure if it's appropriate to post a link to my blog post where I explain why this is true. The original author of MMM agrees, by the way, that it's a disaster.
I give it a shot:
If it looks like this, everything is fine. If the Seconds_Behind_Master is > 0, the replica is catching up. If the Slave_IO_Running is not running, you have an uncommon problem, check the error logs. If Slave_SQL_Running is not running, try to start it with "START SLAVE;". If that fails, check if there is an error mentioned in the "Last_Error" line.
To resync the old master, just add it as slave and let the replication finish. Then you can take the system offline and switch back to the old master.
Transactions are a problem. Especially if you use a transaction agnostic backend like MyISAM. Using InnoDB should work. AFAIK only completed transaction are written to the binlog and thus to the replicas. This will only apply if the database is aware of your transactions.
With the commands given earlier you can check the state of all your slaves and promote the freshest (hint: Log_Pos) slave to master. maybe just temporarily, until all slaves are fresh again and then promote the designated server.
Personally i think you would need a special setup (e.g. mixing WAN and LAN slaves, huge transaction queries) to have slaves with different relay-states after the master crashed.
The main tool I'd recommend is mmm. It handles circular replication, multiple slaves, failover and automatic promotion to master (and associated repointing of slaves), all transparent to clients via managed floating IPs, and it works beautifully (I had a primary DB server disappear just last night due to a dead switch, and my clients didn't even notice).
In conjunction with mmm I'd recommend xtrabackup as it can be used as a really fast and elegant way of setting up new slaves (perhaps to replace a machine that died), much faster than loading from an sql dump.
Also if you're zipping your backups, you NEED pigz - it will knock ~80% off your backup time!