The goal is to have a data link between VPCs in different regions, and the link should keep working even when an availability zone in one VPC or another fails.
Having a simple VPN tunnel with one VPN instance at each end would not work; if the wrong AZ fails at one end, the tunnel is down.
I was hoping Amazon would allow you to create VPN tunnels from one VGW to another, but that does not appear to be possible yet. Too bad, because it would be an easy way to have a fully redundant interconnect option between VPCs in different regions.
There's also Direct Connect which may allow some redundancy (at least at the AWS level) if properly setup. But it takes time to build a DC link, and it's pretty expensive. It's not something you would do just to test an idea.
In theory, having a pair of VPN instances at each end, with BGP and some kind of heartbeat might work, but the setup must be pretty complex (the instances would have to monitor each other and push route changes into the VPC). Are the any HOWTOs on this topic?
Other ideas?
You're correct -- it's a tricky proposition.
I don't know if my configuration is brilliant or ridiculous, but I decided to avoid the need for the network to reconfigure itself by providing each availability zone with its own hardware... here's what I use:
In each AZ, I have a VPN instance, so for two regions, 3 AZ each, that's 6 total. They are t2-class machines, so very cost effective with nice performance. Each of these has a tunnel to all of the units in the remote region, so each of the 3 machines has 3 tunnels to the 3 AZs in the remote region.
The VPC routing is static -- all traffic for the far end IP blocks in each AZ routes to the tunnel server in the local AZ, which encrypts and encapsulates it, and sends it on its way. Each tunnel server has static routes to the subnets in the remote AZs, terminating at the correct VPN node on the far side for the destination availability zone. (The subnets in each AZ are a contiguous /21 netblock, so it's only one route to each).
On the far side, the reverse is true.
So loss of a VPN node results only in isolation of the machines in the availability zone where that node lives, from all zones in the remote region, so the other availability zones in the same region as the failure are not dependent on any kind of failover mechanism -- they aren't impacted.
The actual loss of everything in an entire availability zone would have no impact on any other zone's VPN connectivity with unaffected zones with this setup.
The VPN machines actually have two ways to reach each of their peers; the first is the "normal" route, over the Internet to the peer's elastic IP, and the other is through a backup route, where the VPN servers all connect hub and spoke to a concentrator not in the AWS network -- the theory, there, being that the tunnel might survive a partial loss of direct Internet routability, the traffic taking a different backbone to the alternate hub site and being relayed to the peer, allowing them to pass traffic over a slower and more latent backup path, without a loss of tunnel connectivity in the event of a loss of Internet connectivity. Since primary and alternate paths terminate in the same instance on each end, and the backup "hub" is not tracking flows, brief periods of asymmetric tunnel traffic routing during a flap don't cause any problematic issue.
A failure of a VPN node does have to be addressed in order to restore connectivity to the impacted zone. Or changes to the route tables.