Uploaded image for project: 'Automation Hub'
  1. Automation Hub
  2. AAH-1395

New RepositoryVersion for published repo had 0 content

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • cloud-2022-03-09
    • None
    • None
    • None

      Issue: On 4/16 on c.rh.c an action to create a new pulp RepositoryVersion for the `published` Repository did not copy content from previous RepositoryVersions, resulting in it having 0 content. Content was manually copied back into `published` Repository, but root cause has not been determined.

      Mitigations:

      • We increased the `published` Repository `retain_repo_versions` from 1 to 10000, (edit: 2/22 we updated all AnsibleRepository, including `published` to `retain_repo_versions`= 50, so they are all protected) if this happens again, we can point the pulp `published` Distribution to an accurate RepositoryVersion
      • We temporarily reduced the use of tasks - asking PE to not upload/approve/delete, and turned off ability for uses to click the synclist toggle https://github.com/ansible/galaxy_ng/pull/1138

      Possible root causes:

      • We updated our move task in 4/16 deploy (found to be proper)
      • https://issues.redhat.com/browse/AAH-1384 orphan_protection_time set to 0 can cause race conditions
      • Pulp forum reported upgrade issue from 3.15 to 3.17, same upgrade we did on 4/16
      • Stage env did migration upgrades incrementally, Prod did them all at once
      • Curate task locks only on synclist repo and not upstream_repo used as base repo. “maybe in the meantime the repoversion were cleaned up from the upstream_repo and the base_version you're hoping to find is gone” (update 2/23: logs confirm this situation occurred and then 7 workers died at once, occurred ~2hr before the outage was reported, prevention pr: https://github.com/ansible/galaxy_ng/pull/1141)
      • Collection deletion endpoints worth reviewing
      • cleanup_old_versions should not count unfinished versions. (This is usually protected by only manipulating a repo in a task. It should never encounter unfinished versions.)
      • Error on GET to v3/namespaces: http://pastebin.test.redhat.com/1030923 (this may be caused by is_org_admin always turned to false in mitigation pr, may be causing synclists_owned_by_group to come back false)

              awcrosby5 Andrew Crosby (Inactive)
              awcrosby5 Andrew Crosby (Inactive)
              Clara Spealman Clara Spealman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: