Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-9818

Restarting postgres just before task finish causes discrepancy between foreman and dynflow task status - forever

XMLWordPrintable

    • None
    • None
    • None
    • None

      Description of problem:
      When postgres service is restarted (i.e. as part of all services restart or alone) when dynflow is about to complete a task, then the task can end up hung in a few invalid situations forever.

      "Invalid situation" means e.g.:

      • foreman sees the task as stopped/pending while dynflow sees it as stopped/succes
      • or foreman sees the task as running/pending while dynflow sees it as stopped/success

      "Forever" means there is no user action to fix the status, like:

      • services restart doesnt help
      • force unlock can move foreman task from running/pending to stopped/pending, but nothing else

      Also, until force unlock is done, such stuck task can have acquired its object(s) lock(s).

      Version-Release number of selected component (if applicable):
      Sat6.10.4

      How reproducible:
      100% within a few attempts

      Steps to Reproduce:
      One particular reproducer is to Destroy a CV and just at the end, restart postgres service. It can be VERY tricky to guess the "at the end", so the script below checks for number of completed pulp tasks - for a CV with one repo, the ContentView::Destroy task triggers one pulp task. So whenever the script detects as many new completed pulp tasks as the number of being-destroyed CVs is, the script restarts postgres.

      Script itself:

      -------8<--------------8<--------------8<-------
      CONCUR=${1:-5}
      REPOIDS=${2:-51}
      hmr="hammer shell"

      prepare_cv_to_delete() {
      CVID=$1
      ( echo "content-view create --organization-id=1 --name cv_zoos_${CVID} --repository-ids ${REPOIDS}"
      echo "content-view publish --organization-id=1 --name cv_zoos_${CVID}"
      echo "content-view remove-from-environment --organization-id=1 --name=cv_zoos_${CVID} --lifecycle-environment-id=1"
      echo "content-view version delete --content-view=cv_zoos_${CVID} --version 1.0 --organization-id 1"
      ) | $hmr
      }

      for i in $(seq 1 $CONCUR); do
      prepare_cv_to_delete $i &
      done

      echo "waiting for CVs create+almost-delete"
      time wait

      for i in $(seq 1 $CONCUR); do
      hammer content-view delete --name=cv_zoos_${i} --organization-id 1 &
      done

      echo "$(date): waiting for CVs delete"
      tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count from core_task) to stdout;\"")
      echo "$(date): waiting for CVs delete, pulp tasks=${tasks}"
      expected=$((tasks+CONCUR))
      tasks=0
      while [ $tasks -lt $expected ]; do
      tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count from core_task) to stdout;\"")
      sleep 0.5
      done
      #su - postgres -c "psql pulpcore -c \"select count from core_task;\""
      echo "$(date): restarting postgres as having tasks=${tasks}"
      systemctl restart rh-postgresql12-postgresql.service
      date
      time wait
      su - postgres -c "psql pulpcore -c \"select count from core_task;\""
      -------8<--------------8<--------------8<-------

      Usage:

      ./create_delete_cv_restart_postgres.sh 5 REPOID

      where REPOID is an id of a small repo

      Actual results:
      Random tasks tuck forever, optionally with acquired locks.

      As an example, see attached task export.

      Expected results:
      No such stuck tasks forever. Tasks should be recoverable by a restart or manual (Skip&)Resume.

      Additional info:

              aruzicka@redhat.com Adam Ruzicka
              aruzicka@redhat.com Adam Ruzicka
              Peter Ondrejka Peter Ondrejka
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: