-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.16
-
None
-
False
-
-
Description of problem:
While scale testing Image-Based Upgrades (IBU) for Telco environments managed by ACM, the etcd database space was exceeded preventing applying further ClusterGroupUpgrades (CGU) to initiate more clusters to upgrade until the etcd database was de-fragmented. The environment has 3672 clusters provisioned and managed, of those clusters 3639 became DU compliant and eligible for upgrade testing. The clusters were then split into 7 groups of 500 and 1 group of 139. (8 CGUs) Each group was pushed into prep stage using a CGU and TALM flipped the ACM Policies from inform to enforce. After prep stage completed for every cluster, a new set of 8 CGUs was applied once every 15 minutes to push the ibu object per cluster into upgrade stage. In some instances after applying a CGU, the next CGU would fail to apply due to "etcdserver: mvcc: database space exceeded" An example of the scripts output in sequence would look like: # ./apply-cgu-ibu-prep-4-16-3.sh Thu Aug 8 21:37:47 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0000 created Thu Aug 8 21:57:47 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0001 created Thu Aug 8 22:17:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0002 created Thu Aug 8 22:37:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0003 created Thu Aug 8 22:57:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0004 created Thu Aug 8 23:17:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0005 created Thu Aug 8 23:37:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0006 created Thu Aug 8 23:57:48 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0007 created # ./apply-cgu-ibu-upgrade-4-16-3.sh Fri Aug 9 13:23:35 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0000 created Fri Aug 9 13:38:36 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0001 created Fri Aug 9 13:53:36 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0002 created Fri Aug 9 14:08:36 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0003 created Fri Aug 9 14:23:36 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0004 created Fri Aug 9 14:38:36 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0005 created Fri Aug 9 14:53:37 UTC 2024 Error from server: error when creating "/root/rhacm-ztp/ibu/scripts/cgu-ibu-upgrade-4-16-3-0006.yml": etcdserver: mvcc: database space exceeded Fri Aug 9 15:08:37 UTC 2024 clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0007 created While observing the upgrades and seeing the etcdserver error, I manually defragmented etcd and applied the CGU that was missing, the script could continue because I did those remediation in order to keep the test running. This issue was not hit during every upgrade and never on the first upgrade performed during the scale tests. Sometimes it was hit rapidly a second time during a rollback or finalize stage as well.
Version-Release number of selected component (if applicable):
Hub OCP - 4.16.3 Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3 ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48 TALM - 4.16.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info: