I am currently trying to figure out a good configuration to make a Bastion host highly available. I want to meet the following targets:
- The bastion host(s) need to able to withstand a Availability Zone failure and ec2 instance failure. A small downtime (a few minutes) may be acceptable.
- The bastion host(s) needs to be reachable via a permanent DNS entry.
- No manual intervention needed
My current setup is as follows: Bastion host in Auto Scaling Group in two availability zones, ELB in front of the Auto Scaling Group.
This setup has a few advantages:
- Easy to setup using CloudFormation
- Auto Scaling Groups over two AZs can be used to guarantee availability
- The does not count towards the accounts EIP limit
It also has some disadvantages:
- With two or more bastion hosts behind the ELB, SSH host key warnings are common, and I do not want our users to get accustomed to ignore SSH warnings.
- The ELB costs money, as opposed to EIP. About as much as the bastion host, actually. This is not really much of a concern, I added this point only for sake of completeness.
The obvious other solution is to use an ElasticIP, which has - as I see it - a few drawbacks:
- I can (obvously) not attach an EIP to an Auto Scaling Group directly
- When not using Auto Scaling Groups, I have to put something in place that starts new EC2 bastion hosts if the old ones fail, e.g. using AWS Lambda. This adds additional complexity.
- When the EIP is attached to an Auto Scaling Group manually, on Availability Zone failure, the EIP will get unattached and not be reattached to a new instance. Again, this can be resolved by running a program (on the instance or AWS Lambda) that reattaches the EIP to an instance. Again this adds additional complexity.
What are best practices for High availability SSH instances, i.e. bastion hosts?
It looks like the requirement is to provide bastion functionality at lowest reasonable cost with an RTO of say 5 minutes. No RPO is applicable as it's effectively a stateless proxy that can be rebuilt easily.
I'd have a bastion host, defined either as an AMI or CloudFormation script (AMI is faster), inside an autoscaling group with min/max/target set to 1. I wouldn't have a load balancer as there's no need for that as far as I can see. This instance would be registered with Route53 with a public domain name so even if the instance changes you will be able to access it, and that should eliminate SSH warnings. I might start with one instance in each subnet, but I'd probably turn one off if they're reliable enough - they should be.
A CloudFormation deployment of bastion hosts is described by Amazon here. Amazon have a best practice guide here. You shouldn't address internal resources using their Elastic IP as they're public IPs and traffic to them is charged, whereas private IP traffic isn't charged. Domain names are cheaper. This might involve some CloudFormation script tweaking.