Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-25061

Restarting dynflow-sidekiq@orchestrator.service when raising a new task can lead to a hung "running" task

XMLWordPrintable

    • Moderate
    • No

      Description of problem:
      When a task is being triggered/created during dynflow orchestrator restart, the task might end up in a hung state - a few combinations can happen, like:

      • foreman task planning/pending, dynflow planning/pending
      • foreman task planned/pending, dynflow planned/pending
      • foreman task running/pending, dynflow paused/success
      • foreman task running/error, dynflow paused/success (with error "Could not transition step from pending to running, step already in running.")
      • foreman task running/pending, dynflow running/error (stuck forever, seen at customer on 6.13, error = "Could not transition step from pending to running, step already in running."; this might be fixed in 6.15)

      Version-Release number of selected component (if applicable):
      Sat 6.13 and also 6.15 (versions I tested)

      How reproducible:
      25% in one test round

      Steps to Reproduce:
      basic idea: trigger Actions::Katello::Applicability::Hosts::BulkGenerate tasks in a tight loop and restart the service meantime
      1. Have a few tens of Content Hosts. To populate fake C.Hosts, register one and then run few tens times:

      SHORTNAME=fill-current-shortname
      DOMAINNAME=fill-current-domain-name
      AK=fill-your-activation-key
      ORG=fill-your-organization

      uuid=$(uuidgen)
      echo "{\"dmi.system.uuid\": \"${uuid}\"}" > /etc/rhsm/facts/uuid.facts
      hostnamectl set-hostname ${SHORTNAME{.${uuid%%-*}.${DOMAINNAME}
      subscription-manager clean
      subscription-manager register --activationkey ${AK} --org ${ORG}

      2. Once having 20-30 Content Hosts (such that one BulkGenerate of all of them will take 1-3 seconds), run in 2-3 shells concurrently:

      while true; do echo "ForemanTasks.async_task(::Actions::Katello::Applicability::Hosts::BulkGenerate, host_ids: Host.pluck(:id))"; sleep 0.02; done | foreman-rake console

      3. Once the foreman-rake shells fire new ForemanTasks, restart the service:

      systemctl restart dynflow-sidekiq@orchestrator.service

      4. Wait until the service is restarted, then stop the foreman-rake commands.

      5. Monitor WebUI tasks. A few hundreds of "Bulk generate applicability for hosts" tasks will appear - moving from planning to planned to running to stopped. Wait until no further tasks are changing for some time.

      6. Check if some BulkGenerate task hangs in either wrong state. If so, compare its state/result with its dynflow task state/result.

      7. If no such task exists, goto 2.

      Actual results:
      6. With some patience (and proper tuning of the reproducer, e.g. ensure the rake console fires new tasks with high cadence and maybe worth having multiple --katello-hosts-queue-workers), so with some patience, you will see a BulkGenerate task hung in running forever.

      Expected results:
      6. no hung task.

      Additional info:

            Unassigned Unassigned
            rhn-support-pmoravec Pavel Moravec
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: