Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38291

etcd database space exceeded while initiating 500 concurrent Image based upgrades via CGU

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.16
    • Etcd
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While scale testing Image-Based Upgrades (IBU) for Telco environments managed by ACM, the etcd database space was exceeded preventing applying further ClusterGroupUpgrades (CGU) to initiate more clusters to upgrade until the etcd database was de-fragmented. The environment has 3672 clusters provisioned and managed, of those clusters 3639 became DU compliant and eligible for upgrade testing. The clusters were then split into 7 groups of 500 and 1 group of 139. (8 CGUs) Each group was pushed into prep stage using a CGU and TALM flipped the ACM Policies from inform to enforce. After prep stage completed for every cluster, a new set of 8 CGUs was applied once every 15 minutes to push the ibu object per cluster into upgrade stage. In some instances after applying a CGU, the next CGU would fail to apply due to "etcdserver: mvcc: database space exceeded"
      
      An example of the scripts output in sequence would look like:
      
      # ./apply-cgu-ibu-prep-4-16-3.sh
      Thu Aug  8 21:37:47 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0000 created
      Thu Aug  8 21:57:47 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0001 created
      Thu Aug  8 22:17:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0002 created
      Thu Aug  8 22:37:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0003 created
      Thu Aug  8 22:57:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0004 created
      Thu Aug  8 23:17:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0005 created
      Thu Aug  8 23:37:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0006 created
      Thu Aug  8 23:57:48 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0007 created
      
      # ./apply-cgu-ibu-upgrade-4-16-3.sh
      Fri Aug  9 13:23:35 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0000 created
      Fri Aug  9 13:38:36 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0001 created
      Fri Aug  9 13:53:36 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0002 created
      Fri Aug  9 14:08:36 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0003 created
      Fri Aug  9 14:23:36 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0004 created
      Fri Aug  9 14:38:36 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0005 created
      Fri Aug  9 14:53:37 UTC 2024
      Error from server: error when creating "/root/rhacm-ztp/ibu/scripts/cgu-ibu-upgrade-4-16-3-0006.yml": etcdserver: mvcc: database space exceeded
      Fri Aug  9 15:08:37 UTC 2024
      clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0007 created
      
      
      While observing the upgrades and seeing the etcdserver error, I manually defragmented etcd and applied the CGU that was missing, the script could continue because I did those remediation in order to keep the test running.
      
      This issue was not hit during every upgrade and never on the first upgrade performed during the scale tests. Sometimes it was hit rapidly a second time during a rollback or finalize stage as well.
      
      

       

      Version-Release number of selected component (if applicable):

      Hub OCP - 4.16.3
      Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
      ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
      TALM - 4.16.0

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

              dwest@redhat.com Dean West
              akrzos@redhat.com Alex Krzos
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: