We run 10 servers, mostly cheap requisitioned desktop machines, all running CentOS 5.1 and Xen. All these 10 servers do is run Xen virtual machines. Someone switched off the power supply to the server cupboard and a couple of machines have fried harddisks and will no longer boot. No worries, all the virtual machine disk images have DRBD mirrored copies on different servers, I just need to start those up while I work on getting those two machines back up.
I've replaced the boot drive on each of them and re-installed CentOS, Xen and DRBD. However, after a reboot, the severs can all see their local DRBD devices just fine but report a status of "WFConnection" - waiting for connection - for the remote link. Each device reports as "Secondary/Unknown". I've checked:
- Each server has iptables configured correctly to let DRBD traffic through - this is definite, the iptables config file is from a central repository and is identical to what it was before the machines crashed.
- It's not a DNS problem, as each server has a fixed IP address and DRBD.conf uses those IP addresses directly, so DNS isn't even used. I've made sure the new installs use the same fixed IP addresses and hostnames as the original servers.
- Each server can ping the other one on all IP adresses used, no problem there. The server are all connected to the same switch.
Does anyone know why DRBD is still refusing to connect?
If you have a "split-brain" situation, you can use the following commands to resolve it:
drbdadm secondary all; drbdadm -- --discard-my-data connect all
drbdadm primary all; drbdadm connect all
I have configured drbd to notify me if a split-brain condition occurs. You can do this in /etc/drbd.conf using split-brain handler under the handlers heading.
You will have to manually make the partition primary
drbdadm primary all
. Then you will need to mount the partition.Do you use heartbeat? If so, heartbeat should take care of everything for you. Just run
/usr/lib/heartbeat/hb_takeover
.Okay, I've found the answer. As Brent pointed out, a DRBD device needs to be primary before you can run a Xen virtual machine from it. But I hadn't even got to that stage yet, the two DRBD devices on different machines were refusing to acknowledge the other existed. I finally thought to check through /var/log/messages after a reboot and saw the line "Split-Brain detected, dropping connection!", which gave me something to Google for and turned up these instructions:
http://www.drbd.org/users-guide/s-resolve-split-brain.html
This turned out to be the solution - DRBD was unable to tell which device on which machine was the up-to-date one, so you have to manually tell it which one to use. It'd be nice if DRBD could report the problem as a status ("SPTBrain" instead of simply "WFConnection", maybe) as /var/log/messages gets a whole bunch of stuff in it and I missed the error message the first few times I looked.