Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-19752

Ansible handling of hung jobs

XMLWordPrintable

    • Endeavour
    • False
    • Moderate
    • sat-endeavour
    • None
    • None
    • None
    • Manual

      Let me explain on an example (reproducer):

      • Have Satellite 6.13
      • Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d.
      • Create 2 roles - nfs_hang and run_check.
      • The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely.
      • The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run.
      • Assign role 'nfs_hang' to the jsenkyri-rhel9c host:
      ---
      # tasks file for nfs_hang
      - name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever.
        command: cat /nfs/imports/test/file.txt
        register: cat_output
        ignore_errors: true
      
      - name: Print the output
        debug:
          var: cat_output.stdout_lines
        when: cat_output is defined and cat_output.rc == 0
      
      • Assign role 'run_check' to the other host jsenkyri-rhel9d:
      ---
      # tasks file for run_check
      - name: Get current time and date
        set_fact:
          current_time: "{{ ansible_date_time.iso8601 }}"
      
      - name: Append time and date to /tmp/ansible_runs.txt
        lineinfile:
          path: /tmp/ansible_runs.txt
          line: "Current time and date: {{ current_time }}"
          create: yes
      
      • Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'.

      Result:

      • Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'.

      Expectation:

      • Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'.

      ###

      If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance.

      I suppose batch_size=1 does the trick because then each job has its own ansible-runner:

      # ps -ef | grep ansible
      ~~~
      foreman+ 3019108       1  0 12:49 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
      foreman+ 3019185 3018422  6 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml
      foreman+ 3019186 3018422  7 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml
      foreman+ 3019187 3019185 30 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
      foreman+ 3019189 3019186 32 12:51 pts/1    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml
      foreman+ 3019201 3019187 10 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
      foreman+ 3019209       1  0 12:51 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
      foreman+ 3019218 3019201  1 12:51 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0'
      root     3019232 2954473  0 12:51 pts/9    00:00:00 grep --color=auto ansible
      ~~~
      

      With default batch size (100) there's just 1 runner for both:

      # ps -ef | grep ansible
      ~~~
      foreman+ 3021311 3021160  7 13:00 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml
      foreman+ 3021312 3021311 21 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
      foreman+ 3021320 3021312 10 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
      foreman+ 3021331       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
      foreman+ 3021334       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
      foreman+ 3021349 3021320  0 13:00 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0'
      root     3021362 2954473  0 13:00 pts/9    00:00:00 grep --color=auto ansible
      ~~~
      

      This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems:

      [root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'"
                              label                         | count |   state   | result  
      ------------------------------------------------------+-------+-----------+---------
      ...
      ...
       Actions::RemoteExecution::RunHostJob                 |   104 | paused    | pending
       Actions::RemoteExecution::RunHostJob                 |  1996 | running   | pending
       Actions::RemoteExecution::RunHostsJob                |     1 | paused    | error
       Actions::RemoteExecution::RunHostsJob                |     2 | paused    | pending
       Actions::RemoteExecution::RunHostsJob                |    28 | running   | pending
       Actions::RemoteExecution::RunHostsJob                |     1 | scheduled | pending
      

      Other than configuring batch_size=1 one can try:

      a) Add 'timeout' [0] on the task level:

      • Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue.
      • This doesn't scale well in big environments with many roles.

      b) Use 'free' strategy [1]:

      • By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts.
      • One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below:
      ---
      - hosts: all
        strategy: free
        pre_tasks:
          - name: Display all parameters known for the Foreman host
            debug:
              var: foreman
            tags:
              - always
        tasks:
          - name: Apply roles
            include_role:
              name: "{{ role }}"
            tags:
              - always
            loop: "{{ foreman_ansible_roles }}"
            loop_control:
              loop_var: role
      
      • This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite:
      ---
      TASK [Apply roles] *************************************************************
      
      TASK [run_check : Get current time and date] ***********************************
      ok: [jsenkyri-rhel9d.sysmgmt.lan]
      
      TASK [run_check : Append time and date to /tmp/ansible_runs.txt] ***************
      changed: [jsenkyri-rhel9d.sysmgmt.lan]
      ---
      

      ###

      I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status seems to be Satellite bug though.

      Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well.

      Note: Bug 2156532 [3] seems related.

      [0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task
      [1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html
      [2] https://github.com/ansible/ansible/issues/30411
      [3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532

              Unassigned Unassigned
              rhn-support-jsenkyri Jan Senkyrik
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: