Re-entrant vs idempotent in Ansible roles

I wasted a couple hours tracking down a problem with a raft of new AWS ec2 instances generated using Ansible, and it’s worth explaining because it showcases problem common in a lot of Ansible roles. While Ansible docs talk up the concept of “idempotency” (the ability to run a playbook multiple times without screwing up your hosts,) not much is said about being “re-entrant.” I’ll explain.

Here’s a typical task from an Ansible role:

- name: Create remote_syslog init.d service
  template:
    src: remote_syslog.init.d.j2
    dest: /etc/init.d/remote_syslog
    owner: root
    group: root
    mode: 0775
  notify:
    - enable remote_syslog.init.d
    - restart remote_syslog.init.d
  when: ansible_service_mgr != 'systemd'

The notify: items indicate a handlers like these to be executed at a later time:

- name: enable remote_syslog.init.d
  service:
    name: remote_syslog
    enabled: yes

- name: restart remote_syslog.init.d
  service:
    name: remote_syslog
    state: restarted

There are two reasons you’d notify handlers like this instead of just running them inline as tasks:

You might encounter a bunch of tasks that all notify a handler to run later. Handler execution is deferred as long as possible, so it only has to run once even for multiple “notify:” events.
The handler only gets called then the notifying task is “changed.” What that means depends on the task, but in general it means something was actually, well, changed. If you run this role again, most/many/all of the tasks won’t need to do anything, they won’t be ‘changed” and won’t have to trigger the handlers, either.

That’s all fine and dandy… until a playbook is interrupted. All the talk of “idempotency” implies you can just re-run an interrupted Ansible run and everything will be fine, but not so! Using the examples above:

The Ansible playbook starts up and runs the “Create remote_syslog init.d service” task to create the init.d service. This notifies to enable and restart the service later in the Ansible run.
The run is interrupted, leaving the host setup incomplete
So you run the playbook again to finish the job. This time, the “Create remote_syslog init.d service” task doesn’t need to run because the service was already created. However because the status is not “changed,” the handlers are also not notified.
The Ansible run completes, everything looks fine, but the service has not been restarted or enabled. In this case “enabled” means it will restart on reboot, so now the service is not running, and won’t start even after a reboot.

The problem is that “idempotent” is not “re-entrant.” Re-entrant means the run can be interrupted and re-run and everything will be ok. Here’s a great post from Pivotal, describing the difference. I expected the Ansible role to be re-entrant. It only cost me a little time, but could have easily resulted in security holes or worse.

The lesson is this:

Any Ansible playbook or role that uses notify to trigger handlers (which is like, nearly all of them) cannot be fully re-entrant. Notifications occur only when the notifying task has a changed state, so if a run in interrupted after the task but before the handlers run, the handler queue is lost. Re-running the task typically won’t result in a changed state, so the handlers will never run.

That’s fine, so long as you are expecting this behavior. I have a few words to say about that here!