Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7422

Occasionally an entire CGU will fail to upgrade with BackupTimeout while upgrading many clusters at scale


    • None
    • CNF RAN Sprint 232, CNF RAN Sprint 233, CNF RAN Sprint 234, CNF RAN Sprint 235, CNF RAN Sprint 236
    • 5
    • False
    • Hide



      Description of problem:

      While upgrading 3451 SNOs with the upgrade split across 4 CGUs (SNO Counts per CGU - 1000, 1000, 1000, 451) the last CGU encountered a condition where none of the 451 SNOs performed the upgrade because all showed a status of BackupTimeout.  Later inspection of the SNOs revealed all actually completed the backup job.  Based on the timestamps and logs it seems depending on a CGUs timeout and the schedule of when a CGU is enabled, you may hit a condition where TALM is reconciling for a lengthy period of time on a previous CGU which prevents the next CGU from performing its backup.

      Version-Release number of selected component (if applicable):

      ACM - 2.7.0-DOWNSTREAM-2023-01-26-20-15-10
      Hub OCP 4.12.1
      SNO OCP 4.11.24 upgrading to 4.12.1

      How reproducible:

      Depends on if the number of clusters to upgrade is large enough per CGU and when CGUs are enabled for upgrade, whether or not backup is enabled as well.

      Steps to Reproduce:


      Actual results:


      Expected results:


      Additional info:


        1. 0.log.20230213-054448.gz
          3.08 MB
        2. 0.log.20230213-080233.gz
          2.62 MB
        3. 0.log.20230213-102020.gz
          2.63 MB
        4. 0.log.20230213-123808.gz
          2.66 MB
        5. 0.log.gz
          2.19 MB
        6. backup.jobs
          17 kB
        7. zpu.cgus.yaml
          597 kB

            jche@redhat.com Jun Chen
            akrzos@redhat.com Alex Krzos
            Alex Krzos Alex Krzos
            0 Vote for this issue
            3 Start watching this issue