Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7832

Occasionally an entire CGU will fail to upgrade with BackupTimeout while upgrading many clusters at scale

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.12.z
    • TALM Operator
    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-7422. The following is the description of the original issue:

      Description of problem:

      While upgrading 3451 SNOs with the upgrade split across 4 CGUs (SNO Counts per CGU - 1000, 1000, 1000, 451) the last CGU encountered a condition where none of the 451 SNOs performed the upgrade because all showed a status of BackupTimeout.  Later inspection of the SNOs revealed all actually completed the backup job.  Based on the timestamps and logs it seems depending on a CGUs timeout and the schedule of when a CGU is enabled, you may hit a condition where TALM is reconciling for a lengthy period of time on a previous CGU which prevents the next CGU from performing its backup.

      Version-Release number of selected component (if applicable):

      ACM - 2.7.0-DOWNSTREAM-2023-01-26-20-15-10
      Hub OCP 4.12.1
      SNO OCP 4.11.24 upgrading to 4.12.1

      How reproducible:

      Depends on if the number of clusters to upgrade is large enough per CGU and when CGUs are enabled for upgrade, whether or not backup is enabled as well.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            jche@redhat.com Jun Chen
            openshift-crt-jira-prow OpenShift Prow Bot
            Alex Krzos Alex Krzos
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: