Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 6.13.0
Component/s: Ansible - Configuration Management
Labels:

Blocked:
False
Bugzilla Bug:
RHBZ: 2231853
Severity:
Moderate
AssignedTeam:
sat-endeavour

Release Note Type:
None
Release Note Text:
None
Release Note Status:
None

PX Impact Score:
PX Priority Data:
PX Review Complete:
SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Test Coverage:

Manual

Market:

Let me explain on an example (reproducer):

Have Satellite 6.13
Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d.
Create 2 roles - nfs_hang and run_check.
The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely.
The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run.
Assign role 'nfs_hang' to the jsenkyri-rhel9c host:

---
# tasks file for nfs_hang
- name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever.
  command: cat /nfs/imports/test/file.txt
  register: cat_output
  ignore_errors: true

- name: Print the output
  debug:
    var: cat_output.stdout_lines
  when: cat_output is defined and cat_output.rc == 0

Assign role 'run_check' to the other host jsenkyri-rhel9d:

---
# tasks file for run_check
- name: Get current time and date
  set_fact:
    current_time: "{{ ansible_date_time.iso8601 }}"

- name: Append time and date to /tmp/ansible_runs.txt
  lineinfile:
    path: /tmp/ansible_runs.txt
    line: "Current time and date: {{ current_time }}"
    create: yes

Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'.

Result:

Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'.

Expectation:

Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'.

###

If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance.

I suppose batch_size=1 does the trick because then each job has its own ansible-runner:

# ps -ef | grep ansible
~~~
foreman+ 3019108       1  0 12:49 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3019185 3018422  6 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml
foreman+ 3019186 3018422  7 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml
foreman+ 3019187 3019185 30 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019189 3019186 32 12:51 pts/1    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml
foreman+ 3019201 3019187 10 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019209       1  0 12:51 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3019218 3019201  1 12:51 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0'
root     3019232 2954473  0 12:51 pts/9    00:00:00 grep --color=auto ansible
~~~

With default batch size (100) there's just 1 runner for both:

# ps -ef | grep ansible
~~~
foreman+ 3021311 3021160  7 13:00 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml
foreman+ 3021312 3021311 21 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021320 3021312 10 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021331       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3021334       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3021349 3021320  0 13:00 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0'
root     3021362 2954473  0 13:00 pts/9    00:00:00 grep --color=auto ansible
~~~

This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems:

[root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'"
                        label                         | count |   state   | result  
------------------------------------------------------+-------+-----------+---------
...
...
 Actions::RemoteExecution::RunHostJob                 |   104 | paused    | pending
 Actions::RemoteExecution::RunHostJob                 |  1996 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | paused    | error
 Actions::RemoteExecution::RunHostsJob                |     2 | paused    | pending
 Actions::RemoteExecution::RunHostsJob                |    28 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | scheduled | pending

Other than configuring batch_size=1 one can try:

a) Add 'timeout' [0] on the task level:

Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue.
This doesn't scale well in big environments with many roles.

b) Use 'free' strategy [1]:

By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts.
One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below:

---
- hosts: all
  strategy: free
  pre_tasks:
    - name: Display all parameters known for the Foreman host
      debug:
        var: foreman
      tags:
        - always
  tasks:
    - name: Apply roles
      include_role:
        name: "{{ role }}"
      tags:
        - always
      loop: "{{ foreman_ansible_roles }}"
      loop_control:
        loop_var: role

This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite:

---
TASK [Apply roles] *************************************************************

TASK [run_check : Get current time and date] ***********************************
ok: [jsenkyri-rhel9d.sysmgmt.lan]

TASK [run_check : Append time and date to /tmp/ansible_runs.txt] ***************
changed: [jsenkyri-rhel9d.sysmgmt.lan]
---

###

I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status seems to be Satellite bug though.

Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well.

Note: Bug 2156532 [3] seems related.

[0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task
[1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html
[2] https://github.com/ansible/ansible/issues/30411
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532

relates to

SAT-21428 Misleading job invocation details when running ansible roles in bulk

Closed

links to

[Satellite] Ansible jobs are stuck in running state

Assignee:: Unassigned

Reporter:: Jan Senkyrik

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/08/29 11:20 AM

Updated:: 2025/10/10 12:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates