A problem I keep running into in ansible is where one deployment step should run when any of a number of preparation step is changed, but the changed status is lost due to fatal errors.
When after one successfull preparation step, ansible cannot continue, I still want the machine to eventually reach the state the playbook was meant to achieve. But ansible forgets, e.g.:
- name: "(a) some task is changed"
git:
update: yes
...
notify:
# (b) ansible knows about having to call handler later!
- apply
- name: "(c) connection lost here"
command: ...
notify:
- apply
- name: apply
# (d) handler never runs: on the next invocation git-fetch is a no-op
command: /bin/never
Since the preparation step (a) is now a no-op, running again does not recover this information.
For some tasks, just running ALL handlers is good enough. For others one can rewrite the handlers into tasks that know when:
to run. But some tasks & checks are expensive and/or unreliable, so this is not always good enough.
Partial solutions:
- Write out a file and check for its existence later instead of relying on the ansible handler. This feels like an antipattern. After all, ansible knows whats left to do - I just do not know how to get it to remember it across multiple attempts.
- Stay in a loop until it works or manual fix is applied, however long that may be: This seems like a bad trade, because now I might not be able to use ansible against the same group of targets .. or I have to safeguard against undesirable side-effects of multiple concurrent runs
- Just require a higher reliability of targets so its rare enough to justify always manually resolving these situations, using
--start-at-task=
and checking which handlers are still needed: Experience says, things do occasionally break, and right now I am adding more things that can.
Is there a pattern, feature or trick to properly handle such errors?
The Ansible docs you linked to suggest a way to deal with this:
Placing it in ansible.cfg will ensure that it is the default behavior for every playbook and role you run.
Very little can save you if the host dies during a playbook run.
It seems that currently the only way to tackle this problem is like Michael Hampton pointed out.
IMHO this is not a viable solution since the handlers itself can error caused by the origin error letting the playbook run crash. A better solution should persist handler notification state between playbook executions, ideally at the remote hosts. There already is the concept of facts and custom facts which holds some kind of state at the remotes hosts disk.
Currently I have no working concept how to implement that.