I'm trying to choose a configuration management system for 500-2000 very-geographically-distributed hosts. Due to varying network reliability, it's possible that a number of hosts may be temporarily unavailable at any given time. For this reason, my initial choice was Chef, since it uses a "pull" model, and when hosts come online and check in, they'll immediately get current configuration.
However, if my hosts only poll the Chef server for new configuration every 30 minutes, rapid deployments are impossible. Also, I am not a Rubyist. I would prefer to use a push-based model, where I can push configuration to hosts as rapidly as possible. So, the natural choices seem to be Ansible or SaltStack (probably SaltStack). But my question is: How do Ansible and SaltStack handle failed or down hosts? Is there some way to keep retrying a push forever until a host comes back online? Are there existing patterns for properly handling eventual consistency of down-hosts with either of these tools? Thanks!
I can only answer this for Ansible.
Ansible itself does not handle hosts which are not reachable. It will try to connect and once and if this is not possible the host is thrown out of the current play. But Ansible gives you some tools to deal with this yourelf.
First there is the wait_for module. With this you could wait with a very high timeout until the hosts are available.
This alone though would be a problem when you run the play, because Ansible by default would not process any further tasks until all hosts pass this task. Which is contra-productive in this case. According to your description the first hosts could be again unavailable when the last host finally was reachable.
For this to solve you need to use Ansible 2, which has a new feature called strategies.
strategy: free
allows you to run every task as fast as possible, which means it runs all tasks as soon as the host is available.Still, a connection could go down and in this case there is no built-in way to automatically retry. If the ssh connection can not be established a fatal error will be thrown for this host and since Ansible ~1.9. there is no way to catch this kind of connection error. That does not affect other hosts though, they will all play fine.
You can retry though. Failed hosts will be stored in a file
<playbook-name>.retry
next to the playbook itself. To retry only failed hosts you then could run:Salt runs in a pull model from the nodes to the master. You can issue global commands from the master like
That will run a highstate on all hosts that has a id(hostname) of api*.domain.com. A highstate is like a full chef run.
Usually by default people will either have the master schedule highstate runs on minions or they will run the schedule on the minions themselves to say run a highstate every 10 minutes.
So if a node is down and you run a command on the master to run a state then salt will report the node is down in its run output which can be formatted in many different ways for you to ingest. It can even be logged to mysql for example.
So for example if you ran the above command on the master server to run a highstate on all
api*.domain.com
nodes. If 2 of the 5000 were currently rebooting oncesalt-minion
came back online they would get the even from the master via the message bus and run the highstate.Salt also has a thing called proxy nodes to help the load of a master. You could have a single master somewhere and a proxy node in each datacenter and all the commands sent from the master go through the proxy nodes and the minions in those datacenters hit their proxy node and never the master
To extend Mike's answer, you can do push and pull simultaneously with Salt. Pushing is as easy as
At the same time, your minions can do scheduled pull every X minutes or hours via the built-in scheduler. My preferred method is to configure it via pillar, but adding it to the minion config works too. Something like: