Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-30941

Capsule sync or some Satellite's repo sync gets stuck forever since upgrade to 6.16

XMLWordPrintable

    • Moderate
    • To Do
    • None

      Description of problem:
      Under a specific conditions, an upgrade to 6.16 can leave pulp tasking system with a hung "running" task. That task blocks either whole Capsule sync (if problem happened on Caps), or some repo sync on Sat (when problem appeared on Sat).

      Particular scenario:

      1) pulp task is started before the migration https://github.com/pulp/pulpcore/blob/main/pulpcore/app/migrations/0117_task_unblocked_at.py#L12-L17
      2) pulp services stopped but the task remains in state='running' (this is tricky but it can happen) - if we were starting pulp back, it would detect the task with gone worker and mark it failed
      3) The migration adds `unblocked_at=null` for all tasks, including the running one
      4) the upgraded pulp code treats only unblocked tasks (https://github.com/pulp/pulpcore/blob/main/pulpcore/tasking/worker.py#L314) for further running or cancelling
      5) As an outcome, we have a "running" task with no worker, nulled `unblocked_at`, but acquiring some shared resources - so further tasks of the same type are hung waiting on this, forever

      When this happens on Satellite during a repo sync, that repo would be blocked in further synchronization.

      When this happens on Capsule, then subsequent Caps sync (attempting to sync same repo with hung synchronization) will hang in RefreshRepos initial step, which practically means whole Caps sync hung.

      How reproducible:
      (one step uncertain of particular reproducer)
       

      Is this issue a regression from an earlier version:
      ? yes? (as an outcome of an upgrade, syncing can be hung)
       

      Steps to Reproduce:

      1. Have Caps 6.15 and invoke a bigger Caps sync

      2. When the sync is in progress, upgrade the Capsule to 6.16. There is a chance pulp shutdown will leave a task in `state=running` (this can be tricky to reproduce, but it can happen).

      3. After the upgrade, check if there is a "running" task that has:

      • state='running'
      • empty `worker_id` (no worker is assigned to the running task)
      • empty `unblocked_at` timestamp
      • the task was started before the upgrade (if unsure when the upgrade war run, then su - postgres -c "psql pulpcore -c \"SELECT applied FROM django_migrations WHERE name = '0117_task_unblocked_at';\"" will show the precise timestamp; tasks must be started prior this timestamp)

      4. Try a new Capsule sync that will forcefully sync same repos.

      Actual behavior:
      4. gets stuck forever in RefreshRepos; pulp general update task will be waiting (on a resource acquired by the hung "running" task).

      Expected behavior:
      3. No such hung task exists.
      4. Caps sync does not hung.

      Business Impact / Additional info:
      very specific bug, rare to happen, just one-time event during upgrade to 6.16. BUT with bad user experience, hard to troubleshoot or identify the problem, generic pulp bug (not related to Sat/Caps only).

      I am not sure if or how to prevent this - should some migration step be added to cancel running tasks with empty `unblocked_at` timestamp? Or is "just" https://access.redhat.com/solutions/7104341 sufficient reaction?

              Unassigned Unassigned
              rhn-support-pmoravec Pavel Moravec
              Radek Mynar Radek Mynar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: