I'm working on creating a DR setup and runbook based on AWS.
I don't have any experience with creating DR setups so it would be really helpful if the experienced veterans can guide me through it.
Our Setup:
RDS MYSQL Aurora DB
ElastiCache
Ubuntu 16.04 Linux EC2 instances
Static files stored in S3
Route 53- Total of 250 record sets.
Application Load balancer
Everything is under the same VPC. We're trying to build a PilotLight DR setup.
It depends on what you're trying to achieve and what kind of Disaster (that's the D in DR) you're trying to protect against. The most likely D is an Instance Failure (which includes EC2, ElastiCache node, RDS node, etc). Every other Disaster is quite rare.
Therefore in most cases it's enough to simply make your setup Multi-AZ with proper automatic fail-over and you're done. More specifically:
What's left are the EC2 instances. You should have them in auto-scaling groups (ASG) across multiple Availability Zones, which means that if one instance fails it is automatically recreated elsewhere. Needless to say this requires stateless instances, i.e. all your data should reside in the database or on a shared filesystem like EFS and not on the EC2 instances. Only then you can effectively put them in an ASG.
If that's too hard you can set up CloudWatch Alarm to automatically recover a failed instance - it usually works pretty well too.
Alternatively convert your apps to Docker containers and run them in Fargate cluster which again provides an auto-recovery in case of a container failure.
The bottom line is - when a deployment is property created in a cloud-native way there is almost no reason for the traditional manual DR since high availability and fault tolerance is inherently built in the deployment.
Hope that helps :)