We currently have an active-active 2 node cluster running virtual machines. Every node has two disks, every disk is DRBD mirrored on the other node. Every node runs virtual machines out of his primary drbd device, and the pacemaker cluster will handle failover (if a node fails, the other becomes primary on both drbd devices and run all the VMs). This is colocated in a datacenter so our costs (beside hardware acquisition) are driven by how many rack units we occupy.
When you start out small this is a great solution, it fits in 2U of rack space (assume ethernet switch(es) is/are already there) and it's 100% redundant. But it's also a slightly difficult setup to manage and it suffers when I/O load goes too high (I guess that's just because of the low number of spindles).
I'm wondering what could be the best solution to scale above our hardware capacity while still being cost-effective and be as redundant as it's reasonable:
- go on by adding more two node clusters with internal storage, maybe with bigger hardware (eg. 2U servers with more disks)
- still use two nodes clusters but with external direct attached storage (stuff like 1U or 2U disk enclosures with SAS links) - see note below
- separate storage and VMs (eg. a pair of storage nodes, mirrored by DRBD, which export iSCSI and manage failover by moving the iSCSI target IP, coupled with two or more diskless nodes which run VMs out of the static iSCSI target IP) - this seems to be what others are doing?
- use something different from standard servers for the storage part (dedicated storage solutions on gigabit ethernet?)
- anything else altogether ?
The split storage/application servers seems the most flexible and reasonable solution to me, we could easily add more storage nodes when needed and still use the current application servers or do the other way around when we hit capacity limits.
What do you think are good / bad choices? Do you already have experience on this kind of stuff without bigger budgets (I tend to rule out fibre channel or 10000 euros storage appliances)?
EDIT: To be clear, the idea is that by leveraging modern (and free) software you could implement redundancy just by adding more "commodity" hardware. It's not going to be screamin' fast nor super high availability, but it will let our VM run even if a motherboard dies for as long as it takes to get a spare to the DC and replace the part.
EDIT: I removed the usb mention because it's really not going anywhere (and thanks for pointing that out in replies). I don't really know how I forgot about SAS enclosures. As an example from the Dell website, an MD1000 is 2U with SAS links. Two enclosures attached to two storage nodes via SAS and they could do redundancy and export iSCSI.
First USB disks are never going to give you good performance.
You seem to want two things that typically don't go together. A fully redundant highly available solution for little to no money. Redundancy is expensive, especially the more options that you want. On the low end of redundant solutions you have smaller EMC, NetApp and Dell Equallogic storage solutions. All support iSCSI and fiber channel so you can connect to them how ever you want to. They pretty much all start in the 4-5U size and move up from there. These give you a good storage platform that you can then build your virtual cluster around.
Doing host based storage replication from internal disks on one box to internal disks on another box just won't scale for very long. Eventually the network will run out of bandwidth or the disks won't be able to keep up with the load that you are putting them under. And the replication basically double the load on the disks as every write to the disks has to be read then transferred across the network to the other host for writing there. Plus all the heartbeat traffic that has to happen as well.
If the business needs to have a high availability solution in place then they should really plan on paying for it. Doing it on the cheep will only work for so long, and will end up biting you eventually.
You definitely want to separate the disks and VMs because you will want the VM nodes to access shared storage (rather than separate mirrored) so that failover operations are nearly seamless. I would deprecate OS-level clustering in favor of VM-level clustering as in my experience the data stores tend to be more vulnerable points than the hardware and OS (provided that the OS has been set up for stability) and OS-level problems affecting one node of a cluster tend to carry over to the other node (bad updates, netowrk issues, etc.) thereby rendering OS clustering ineffectual. The VMs should have local disks just to run the Hypervisors, but the VM machine disks should be on the shared storage (and you will want that shared storage at least at hardware RAID5). Putting the VMs into a shared resource cluster (a la VMWare) is the way to go because it allows you to do very granular automatic load balancing. With this setup, adding new hardware to the setup becomes a matter of adding the new VM server to the shared disk, putting the Hypervisor on it, and joining it to the cluster.
I don't have any recommendations on the type of shared storage, since people who know the world of shared storage and VMs tend to have very good data and I defer to their judgement.
The only way you'll get any significant improvement in IO across data-centers... is to invest in some crazy amounts of dedicated bandwidth between the data centers. Clustered file systems rely heavily on minimal latency & high bandwidth in order to be able to perform well. When there is latency... the IO bottleneck is exponentially worse. (1-10ms is fine not bad... 10-30ms is not great... 30ms+ is pretty bad.)
There are several ways to help mitigate some of this overhead... by using other storage methods... like the S3 storage... or a simple replicated file system.
The down side is that because they are replicated... if one side updates a file at near-same time as the other or one side updates files too frequently... you end up with a disjointed replication.... which can be a nightmare to sort out. These types of storage are great if you do infrequent commits... and lots of reads.
Trying to implement things like Amazon's EBS... or S3 on the cheap is unlikely at best. They have a much bigger budget and HUGE amounts of bandwidth between their dataceters to play with.