Backround: We are in need of a HA server in a small office environment and are looking at DRBD to provide it. We only have about 100GB that needs to be on the HA server and server load will be extremely low. The data will probably increase about 10%-25% per year if we archive older office data, and 50%-75% each year if we don't.
Point is we use a mix of consumer grade and used enterprise grade hardware which WILL be a problem if we don't preemptively plan for it; and pre-built quality servers DO fail, so redundant servers seems like the way to go.
The Plan: We are thinking it would be good to find (2) of the best bang-for-our-buck used servers and synchronize them. We simply need SATA/SAS capable servers and space for as many drives as can be had for the price. These servers seem like they can be had for $100-$200 (+some parts and additional drives) if you catch a deal.
This would theoretically mean a server could fail and if we took days to get to it, as long as we didn't have another coincidental failure, things would still hum along until our IT department (me) could get to it. We would use Debian as an OS.
Some Questions
(A) How does DRBD handle drive or controller failure? That is This shows DRBD before the storage driver, so what happens when the controller fails and writes dirty data or the drive fails but doesn't crash immediately? Is the data mirrored to the other server or not and is there risk of data corruption across servers in cases like these?
(B) What are the fail points for DRBD; that is theoretically as long as one server is up and running there are no issues EVER. But we know that there are issues so what are the fail modes using DRBD since most of them should theoretically be software?
If we are going to have two servers for this, would it be reasonable to run VM's on each with MYSQL and Apache for database and web server replication? (I am assuming so)
Is DRBD reliable enough? If not, is the unreliability isolated to certain tasks, or is it more random. Searching turned up people with various issue but this IS the internet with seemingly more bad info than good.
If data is being synchronized over LAN, does DRBD use double the bandwidth? That is, should we double up on NICS and do some link aggregation and trunking? Then maybe put them on separate routers on separate circuits and UPS's in separate rooms and now you really have some redundancy!
Is this too crazy for an office in terms of server management? Is there a simpler REALTIME alternative (granted DRBD seems simple in theory).
We already have a server. So it seems to me a second USED server with a dedicated drive for DRBD could easily be had for around $150-$250 with some smart shopping. Add a second router, more drives, more NIC's (Used), and (2) UPS's and were talking $1,000 +/-. That is relatively cheap! And I am hoping this would mainly buy us time during a server fault. Drive failures seem like the easier thing to handle with RAID these days. It's other hardware failures like controllers, memory, or power supplies that might require downtime to diagnose and fix that are the concern.
Redundant servers for us means used hardware becomes more viable with more up time and more flexibility for me to fix things when my schedule allows vs having to stop everything to repair the server.
Hopefully I didn't miss that these questions have easy searchable answers. I did a quick search and didn't find what I was looking for.
First, you need to define what you really mean by "HA". What are you protecting against, what are the costs of an outage of type X and duration Y? How will it affect your organization? What is your role in this organization anyway and what is your time worth? How much time can you spend on this? After that, you have to decide if this requirements allow this kind of solution or if you need something else.
Second: In my world, the sentences "I need HA" and "I am going to buy crappy used servers for 200$" don't possibly fit together (in fact, for me buying used crap and professional use of any kind don't fit together at all).
Anyway, your questions:
If you write completely new data to the DRBD block device, it will be written correctly on the non-broken controller. It's a completely transparent layer in front of the actual disks, just as a software RAID or LVM. However, if you have data corruption on the primary node due to broken controllers or read errors from the disk, this could easily propagate to the secondary node since write operations are often read-modify-write cycles, and in this case, a block of corrupted data will be read on the primary node and a write operation for this block is sent to both nodes. This brings up the most important point when using DRBD: Same as a RAID, it is in no way a replacement for a good and reliable backup.
I don't understand what you mean here.
When using VMs in a single node setup is useful, it will be in the two-node setup as well, and you'll have the advantage of possible live migration when done right.
In my experience, yes. You should test it thoroughly in your environment though and spend a lot of time simulating the various fail states the system can experience and learn and document how to recover from them. While it's reliable, DRBD is not self-healing and requires a good understanding of the situation to recover from a failure condition.
You really want a dedicated connection between the nodes. In a two-node setup, this can be a point to point connection without a switch or something. Everything else might be possible technically but is just nonsense. Depending on your usage pattern, using trunking or faster NICs (e.g. 10G ethernet or Infiniband) for this dedicated link might be beneficial, but if most/all of the data to read or write comes from the LAN interface, this won't help as you are limited by the LAN anyway.
This comes back to my first paragraph: What do you expect from it and what do you consider HA? For an experienced system administrator, it can be a cheap and reliable way to protect from a range of failures, but it requires a lot of fundamental understanding of how the parts fit together. Many small shops without such an experienced full time SA are better of with quality hardware and a good support contract though.
Finally: Don't try to retroactive fit any HA solution on your current hardware. As I wrote, you need the time to experiment with the setup and its failure conditions. This requires a lot of downtime and can't be reasonable done on your production hardware.