We currently run FreePBX on a single node. I'd like to have a cluster of asterisk instances for sharing the load, but mainly for failover. I'm curious how other people have solved this problem. Ideally I'd like to use "free" (as in beer) software.
Thanks for any help!
EDIT
More info requested... to be honest though, we'll try anything if it's a good solution :)
We don't really care if calls get interrupted during a failover. This would be ideal, but I can imagine making this happen would be very complex due to the amount of state that has to be replicated between nodes.
I don't think we don't use SIP re-invite. I've never actually come across that term, despite running freepbx for over a year. Our use case is just allowing people to place outbound calls with a softphone and route a few inbound calls depending on DID.
Right now we run FreePBX in a LXC container. This works out because our upstream carriers are also SIP (not ISDN). We have experience with KVM, but we'd ideally like to run things in an LXC container for efficiency reasons.
Start by having a look at this voip-info web page: high availability design. It will explain what high availability is and isn't - in the context of Asterisk. (Its easy to confuse high availability with load balancing)
Next, look at this voip-info web page: high availability products. It will explain what is out there for open source / commercial solutions for high availability Asterisk.
No solutions claim to keep calls up during failover (while staying within the standard SIP protocol without introducing a new single point of failure). SIP re-invites are used for allowing RTP media streams directly between endpoints - but Asterisk normally stays in the SIP stream. Not really an HA issue.
Something else to think about is defining a 'failure'. In the most simplistic terms it's the Asterisk process dying. But often the Asterisk process is alive just not bridging calls (so avoid simplistic process monitoring scripts). What if the network connection goes out at the local data center (or a firewall fails). Your HA solution should be able to consider environmental factors like upstream routes/etc to determine if a peer can no longer offer telephony service. Some solutions using generic Linux heartbeat software which doesn't have any deep asterisk visibility, or environmental visibility.
What about keeping data in sync between peers? From voicemails, to configuration data, to phone set firmware, etc. Solutions like DRBD make it easy, but corruption by one peer immediately corrupts the other. For example, if a corrupt process on one peer damages critical asterisk files, will the other peer start (if they use DRBD then no). So avoid DRBD based 'solutions'.
If you introduce load balancing (i.e. multiple active peers) which one 'wins' in the event that 2 peers each receive voicemail #1 for user 123 at the same time? This require you introduce front-end servers for call bridging, back-end for voicemail, etc. And you still have single points of failure or shared components.
If you recover from a failure and the cluster needs to re-assemble, what happens if each peer wrote data to its copy of the shared 'disk'? Do you manually start reconciling? What if 2 peers come up at once (dual active) - which one wins and takes over? If you introduce a shared disk solution (DRBD, NFS, iSCSI) then you eliminate one of the biggest and most important elements of an HA solution: peer autonomy. So look for 'synchronization', not 'shared disk'.
The cheapest 'HA' solutions for asterisk tend to use a shared virtual disk (eg: DRBD/NFS/SMB) and/or a shared channel bank (eg: Astribank). As you will read above, real HA solutions (like the ones used in 911/PSAP call centers) require completely autonomous peers and call paths. There was a (commercial) FreePBX module which uses a shared disk (so if one peer goes down and corrupts the disk, then the other peer is corrupted as well) and basic detection - but it's cheap and easy to install for a home user; however, FreePBX discontinued support for the product years ago (only works with very old FreePBX versions). Elastix offered a similar module to FreePBX for free. (If you are Linux savvy then you can create the same 'module' for free with the Linux DRBD and Heartbeat packages available at no charge) but Elastix is now Isabbel (new product name) so I think the how-to guide is now gone. On the high end is HAAst a (free / commercial) product which has no shared components and uses sophisticated health detection, and is compatible with all Asterisk distributions - but requires more Linux skills to install and can be more expensive depending on the edition (more for enterprise phone systems). And off to the side is a (free script) flipit script that is simple to install but it's a stretch to call it 'HA'. There is also VMware which offers generic HA (but it's not PBX/trunk/SIP/etc aware), and you will also find some vendors offering RAID 1 as "HA" for a PBX but that's a stretch. And there are more products in this spectrum too. No vendor 'approves of' or 'endorses' or 'certifies' any other product, so you have to try before you buy.
You will also find people offering 'containers' as HA solutions, but that isn't really HA. Containers are convenient ways to deploy software, and you can have a spare PBX container ready to deploy; but, you don't have any synchronization of settings/voicemails/etc, no detection of failures, etc.
Just be sure to ask the right questions when you do evaluate products! No single product is right for everyone - but the voip-info HA design page will help you select one based on the right trade-offs. If you need to meet 911/PSAP standards or are building for a high volume call center, have a look at the high-end HAAst product. If it's for home use, try flipit or the free edition of one of the commercial products first.