Using DRBD version: 8.2.6 (api:88/proto:86-88)
Here is the contents of /etc/ha.d/haresources
db1 192.168.100.200/24/eth0 drbddisk::mysql Filesystem::/dev/drbd0::/drbd::ext3::defaults mysql
and /etc/ha.d/ha.cf
logfile /var/log/ha-log logfacility local0 keepalive 1 deadtime 30 warntime 10 initdead 120 udpport 694 bcast eth0, eth4 auto_failback off node db1 node db2 respawn hacluster /usr/lib64/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster deadping 5
When testing failover between machines I ran the following commands on db2:
service heartbeat stop service mysqld stop drbdadm down mysql service drbd stop
/proc/drbd on db1 reported
0: cs:Connected st:Primary/Unknown ds:UpToDate/DUnknown C r---
What happened next, after:
- Bringing the services back online on db2
- Transferring primary to db2 using hb_primary script
- Taking db1 down as above
- Bringing the services back online on db1
- Transferring primary back to db1 using hb_primary script
was db1 remounted the DRBD disk, assumed the correct IP and started MySQL. There was massive MySQL table corruption; it was all fixable (using InnoDB recovery mode 6, mysqlcheck and the occasional backup), but how did it happen?
I speculate:
- DRBD disconnected the disk from the filesystem while it was being used by MySQL, as a clean MySQL shutdown would not have resulted in corrupt data
- heartbeat controlled DRBD, and stopping the heartbeat service "pulled the plug" on DRBD
- this may happen again in the case of an actual failover (due to heartbeat ping timeout)
I do not have access to this setup again for some time, and would like to repeat the test.
Are the configuration settings correct?
Was the corruption the result of my manual testing?
Is there a better way to test failover than to stop the heartbeat service and let it run the haresources commands?
This probably isn't a big help, but this has been discussed extensively of late over at the Pacemaker and Linux-HA mailing lists.
I'm not very good with heartbeat, but with pacemaker I would set up a constraint that caused the cluster resource manager to flush disks with a write lock to the disk (or down mysql temporarily) before trying to switch over, and then releasing the lock once the switch had been completed.
From everything I've read, and my limited experience with heartbeat, all you should have to do to manually failover from one server to another is issue the
command. Everything that is in your haresources file will be controlled by heartbeat. Case in point, I have a cluster I'm setting up that needs to run the following services:
Here is the haresources config
and here's the results I get (my appologies if it's a mess, I can't get the line breaks in the right spot):
notice that stopping heartbeat stopped all the services that are assigned to heartbeat (mysqld, snmpd); also notice that drbd is still running and heartbeat did NOT stop it. DRBD needs to be running the whole time for failover to work.
Try your failover again, but don't run the drbd commands, and I think you'll avoid your data corruption.
The way to test heartbeat would be that you will issue service heartbeat stop on one machine and it fails over to the other machine and automatically brings up all the services on the other node, also you do not want to turn of drbd services .
The other way to test is to do a hard reboot on one machine.