I have a primary-failover setup for HAProxy on two servers using Heartbeat. It has been running smooth for a while now. Today our service was down for a few minutes when the secondary server thought that the primary was down. It attempted to take over the shared IP, but it wasn't able to, as the primary server was still holding on to it. However, according to logs, the primary appears to have been communicating with the secondary regarding the takeover, so it doesn't make sense.
After resolving the issue by restarting Heartbeat on the primary, I noticed that the time on the primary was off by about 5 minutes compared to the secondary. Does Heartbeat use time to distinguish if a resource is down?
ServerNode1 syslog:
Jun 8 14:25:51 serverNode heartbeat: [15461]: ERROR: Both machines own our resources!
Jun 8 14:25:52 serverNode heartbeat: [15461]: ERROR: Both machines own our resources!
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Received shutdown notice from 'serverNode2'.
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Resources being acquired from serverNode2.
Jun 8 14:25:56 serverNode heartbeat: [15461]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:25:56 serverNode heartbeat: [18058]: info: acquire local HA resources (standby).
Jun 8 14:25:56 serverNode ResourceManager[18087]: info: Acquiring resource group: serverNode xxx.xxx.xxx.88
Jun 8 14:25:56 serverNode IPaddr[18124]: INFO: Running OK
Jun 8 14:25:56 serverNode IPaddr[18138]: INFO: Running OK
Jun 8 14:25:56 serverNode heartbeat: [18059]: info: Local Resource acquisition completed.
Jun 8 14:25:56 serverNode heartbeat: [18058]: info: local HA resource acquisition completed (standby).
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Standby resource acquisition done [foreign].
Jun 8 14:25:56 serverNode heartbeat: [15461]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:25:56 serverNode heartbeat: [18184]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Jun 8 14:25:56 serverNode harc[18184]: info: Running /etc/ha.d//rc.d/status status
Jun 8 14:25:56 serverNode mach_down[18199]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
Jun 8 14:25:56 serverNode mach_down[18199]: info: mach_down takeover complete for node serverNode2.
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: mach_down takeover complete.
ServerNode2 syslog:
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: node serverNode: is dead
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: No STONITH device configured.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: Shared disks are not protected.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: Resources being acquired from serverNode.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: Link serverNode:eth0 dead.
Jun 8 14:31:33 serverNode2 heartbeat: [30881]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Jun 8 14:31:33 serverNode2 harc[30881]: info: Running /etc/ha.d//rc.d/status status
Jun 8 14:31:33 serverNode2 heartbeat: [30882]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys serverNode2] to acquire.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: Taking over resource group xxx.xxx.xxx.88
Jun 8 14:31:33 serverNode2 ResourceManager[30934]: info: Acquiring resource group: serverNode xxx.xxx.xxx.88
Jun 8 14:31:33 serverNode2 IPaddr[30961]: INFO: Resource is stopped
Jun 8 14:31:33 serverNode2 ResourceManager[30934]: info: Running /etc/ha.d/resource.d/IPaddr xxx.xxx.xxx.88 start
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: Using calculated nic for xxx.xxx.xxx.88: eth0
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: Using calculated netmask for xxx.xxx.xxx.88: 255.255.255.0
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: eval ifconfig eth0:0 xxx.xxx.xxx.88 netmask 255.255.255.0 broadcast xxx.xxx.xxx.255
Jun 8 14:31:33 serverNode2 IPaddr[31007]: INFO: Success
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: mach_down takeover complete.
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: mach_down takeover complete for node serverNode.
No, the clocks being off will not break the relationship. However if the clocks changed dramatically it would cause errors in the log, they would look like:
But it would not kill the primary.
What it looks like is communication broke between the servers. Specifically it looks like either server1 was no longer able to send out data or server2 was not receiving properly. This could be due to some buffer problem. Are you tracking network buffer space? (via snmp or netstat) Or maybe a networking issue somewhere, switchport errors?
When you say the site was down, do you have monitoring testing the service on each serverX to an IP specific to that server? Did it indicate if either of the servers was down during that period in addition to the VIP not working? Do traffic graphs or error/drop counts show anything interesting for that time period?
Need more information.
My suspicion is that something is filtering traffic between your hearbeat nodes. Iptables is one possibility. Depending on your physical topology, other devices may also be suspect.