We are in the process of moving from AWS where we have a highly available system setup using EC2's auto scaling feature. However, we aren't using this to change the size of the pool based on resource usage, we are simply using it to spin up new instances when one of them fails or becomes unresponsive.
Without this auto scaling feature on other cloud providers (we are specifically looking at DigitalOcean, but it should apply anywhere), what are some options to achieve this setup? My first thought was to create an instance that monitors the others, but then that server becomes a single point of failure. Are there any services or established patterns to accomplish this whether automated or writing some scripts to the API without creating a single point of failure?
We ended up writing our own solution to somewhat mimic the behavior in EC2. We called it healthcare.js and open-sourced it at https://github.com/goldfire/healthcare.js. Essentially, it uses the DigitalOcean API and tags for discovery, and then uses democracy.js to monitor which servers are running. This allows for a fully distributed self-healing system that will kill/re-build servers based on the passed server configs.