-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
6.13.0
Let me explain on an example (reproducer):
- Have Satellite 6.13
- Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d.
- Create 2 roles - nfs_hang and run_check.
- The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely.
- The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run.
- Assign role 'nfs_hang' to the jsenkyri-rhel9c host:
--- # tasks file for nfs_hang - name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever. command: cat /nfs/imports/test/file.txt register: cat_output ignore_errors: true - name: Print the output debug: var: cat_output.stdout_lines when: cat_output is defined and cat_output.rc == 0
- Assign role 'run_check' to the other host jsenkyri-rhel9d:
--- # tasks file for run_check - name: Get current time and date set_fact: current_time: "{{ ansible_date_time.iso8601 }}" - name: Append time and date to /tmp/ansible_runs.txt lineinfile: path: /tmp/ansible_runs.txt line: "Current time and date: {{ current_time }}" create: yes
- Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'.
Result:
- Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'.
Expectation:
- Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'.
###
If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance.
I suppose batch_size=1 does the trick because then each job has its own ansible-runner:
# ps -ef | grep ansible ~~~ foreman+ 3019108 1 0 12:49 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux] foreman+ 3019185 3018422 6 12:51 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml foreman+ 3019186 3018422 7 12:51 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml foreman+ 3019187 3019185 30 12:51 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml foreman+ 3019189 3019186 32 12:51 pts/1 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml foreman+ 3019201 3019187 10 12:51 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml foreman+ 3019209 1 0 12:51 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux] foreman+ 3019218 3019201 1 12:51 pts/0 00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0' root 3019232 2954473 0 12:51 pts/9 00:00:00 grep --color=auto ansible ~~~
With default batch size (100) there's just 1 runner for both:
# ps -ef | grep ansible ~~~ foreman+ 3021311 3021160 7 13:00 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml foreman+ 3021312 3021311 21 13:00 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml foreman+ 3021320 3021312 10 13:00 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml foreman+ 3021331 1 0 13:00 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux] foreman+ 3021334 1 0 13:00 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux] foreman+ 3021349 3021320 0 13:00 pts/0 00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0' root 3021362 2954473 0 13:00 pts/9 00:00:00 grep --color=auto ansible ~~~
This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems:
[root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'"
label | count | state | result
------------------------------------------------------+-------+-----------+---------
...
...
Actions::RemoteExecution::RunHostJob | 104 | paused | pending
Actions::RemoteExecution::RunHostJob | 1996 | running | pending
Actions::RemoteExecution::RunHostsJob | 1 | paused | error
Actions::RemoteExecution::RunHostsJob | 2 | paused | pending
Actions::RemoteExecution::RunHostsJob | 28 | running | pending
Actions::RemoteExecution::RunHostsJob | 1 | scheduled | pending
Other than configuring batch_size=1 one can try:
a) Add 'timeout' [0] on the task level:
- Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue.
- This doesn't scale well in big environments with many roles.
b) Use 'free' strategy [1]:
- By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts.
- One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below:
--- - hosts: all strategy: free pre_tasks: - name: Display all parameters known for the Foreman host debug: var: foreman tags: - always tasks: - name: Apply roles include_role: name: "{{ role }}" tags: - always loop: "{{ foreman_ansible_roles }}" loop_control: loop_var: role
- This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite:
--- TASK [Apply roles] ************************************************************* TASK [run_check : Get current time and date] *********************************** ok: [jsenkyri-rhel9d.sysmgmt.lan] TASK [run_check : Append time and date to /tmp/ansible_runs.txt] *************** changed: [jsenkyri-rhel9d.sysmgmt.lan] ---
###
I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status seems to be Satellite bug though.
Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well.
Note: Bug 2156532 [3] seems related.
[0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task
[1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html
[2] https://github.com/ansible/ansible/issues/30411
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532
- relates to
-
SAT-21428 Misleading job invocation details when running ansible roles in bulk
-
- New
-
- links to