Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7422

Occasionally an entire CGU will fail to upgrade with BackupTimeout while upgrading many clusters at scale

XMLWordPrintable

    • None
    • CNF RAN Sprint 232, CNF RAN Sprint 233, CNF RAN Sprint 234, CNF RAN Sprint 235, CNF RAN Sprint 236
    • 5
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While upgrading 3451 SNOs with the upgrade split across 4 CGUs (SNO Counts per CGU - 1000, 1000, 1000, 451) the last CGU encountered a condition where none of the 451 SNOs performed the upgrade because all showed a status of BackupTimeout.  Later inspection of the SNOs revealed all actually completed the backup job.  Based on the timestamps and logs it seems depending on a CGUs timeout and the schedule of when a CGU is enabled, you may hit a condition where TALM is reconciling for a lengthy period of time on a previous CGU which prevents the next CGU from performing its backup.

      Version-Release number of selected component (if applicable):

      ACM - 2.7.0-DOWNSTREAM-2023-01-26-20-15-10
      Hub OCP 4.12.1
      SNO OCP 4.11.24 upgrading to 4.12.1

      How reproducible:

      Depends on if the number of clusters to upgrade is large enough per CGU and when CGUs are enabled for upgrade, whether or not backup is enabled as well.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

        1. 0.log.gz
          2.19 MB
        2. 0.log.20230213-123808.gz
          2.66 MB
        3. 0.log.20230213-080233.gz
          2.62 MB
        4. 0.log.20230213-102020.gz
          2.63 MB
        5. 0.log.20230213-054448.gz
          3.08 MB
        6. backup.jobs
          17 kB
        7. zpu.cgus.yaml
          597 kB

              jche@redhat.com Jun Chen
              akrzos@redhat.com Alex Krzos
              Alex Krzos Alex Krzos
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: