A problem I keep running into in ansible is where one deployment step should run when any of a number of preparation step is changed, but the changed status is lost due to fatal errors.
When after one successfull preparation step, ansible cannot continue, I still want the machine to eventually reach the state the playbook was meant to achieve. But ansible forgets, e.g.:
- name: "(a) some task is changed"
git:
update: yes
...
notify:
# (b) ansible knows about having to call handler later!
- apply
- name: "(c) connection lost here"
command: ...
notify:
- apply
- name: apply
# (d) handler never runs: on the next invocation git-fetch is a no-op
command: /bin/never
Since the preparation step (a) is now a no-op, running again does not recover this information.
For some tasks, just running ALL handlers is good enough. For others one can rewrite the handlers into tasks that know when:
to run. But some tasks & checks are expensive and/or unreliable, so this is not always good enough.
Partial solutions:
- Write out a file and check for its existence later instead of relying on the ansible handler. This feels like an antipattern. After all, ansible knows whats left to do - I just do not know how to get it to remember it across multiple attempts.
- Stay in a loop until it works or manual fix is applied, however long that may be: This seems like a bad trade, because now I might not be able to use ansible against the same group of targets .. or I have to safeguard against undesirable side-effects of multiple concurrent runs
- Just require a higher reliability of targets so its rare enough to justify always manually resolving these situations, using
--start-at-task=
and checking which handlers are still needed: Experience says, things do occasionally break, and right now I am adding more things that can.
Is there a pattern, feature or trick to properly handle such errors?