I'd like some feedback about this failover strategy for a couple of MySQL servers I'm speccing up for a cluster, and I want to check if there's something obvious I'm not missing here.
One app server that connects to a mysql master server in day to day operations, and that has a mysql-server set up as a slave to it for master-slave replication purposes.
If the mysql server fails, I want to the web app to try connecting to the master, and then after n
failed attempts, perform the following:
- assume the master will no longer be available
- send a signal to the slave server to stop replication
- send a signal to the slave server to tell it to act as the new mysql master
- begin connecting to the server again, and treat it like the master from now on
Once the app is up again, and serving users, I'd like to be able to spin up a new slave server in the background, once it's ready to serve requests, set up master slave replication once again to provide the same failover support as before.
I'm pretty sure this has been done before, but I can't see any guides on this, so I'm assuming there must be some obvious reason you wouldn't try this, that I haven't thought of yet.
What are the pitfalls of taking this approach for providing automated failover like this with MySQL?
As an aside, I'm aware of master-master replication, but a) I've seen it go horribly wrong, and it b) seems worryingly over-complicated.
Thanks
What you want is a High Availability cluster and I think that your suggested approach seems a bit strange.
A good way to achieve this is creating a Linux HA cluster and sync your MySQL using DRDB sync on filesystem level.
In such a setup you have 3 things:
Instead of making a lot of code in your application you use a virtual IP address that you move around to the current active node. Also you use STONITH (Shoot The Other Node In The Head (I did not make this up)) to make sure that the first node is actually dead before trying to take over the resources.
There's some great material to read on these links: http://www.linux-ha.org/wiki/Main_Page http://www.clusterlabs.org/wiki/DRBD_MySQL_HowTo http://theclusterguy.clusterlabs.org/
The reason automatic failover is not conducive has to do with replication lag. If the slave happens to be behind and failover occurs, you may be writing updates with keys that do not exist yet because the inserts from the master has not been written yet. The more replication lag, the more this is a problem. At my company we use DRBD for automatic failover since the DRBD server you failover to is an exact disk level copy of the original master. As a policy, we do manual for failover of master/slave and master/master setups.
I wouldn't rule out master-master replication. In fact what you describe is almost master-master replication.
Have a look at MMM (Multi-Master Replication Manager for MySQL). http://mysql-mmm.org/ It works at the mysql layer so it works much better than OS-based clustering.