Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Etcd
Labels:
- perfscale-telco-5g
- telco-5g

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

While scale testing Image-Based Upgrades (IBU) for Telco environments managed by ACM, the etcd database space was exceeded preventing applying further ClusterGroupUpgrades (CGU) to initiate more clusters to upgrade until the etcd database was de-fragmented. The environment has 3672 clusters provisioned and managed, of those clusters 3639 became DU compliant and eligible for upgrade testing. The clusters were then split into 7 groups of 500 and 1 group of 139. (8 CGUs) Each group was pushed into prep stage using a CGU and TALM flipped the ACM Policies from inform to enforce. After prep stage completed for every cluster, a new set of 8 CGUs was applied once every 15 minutes to push the ibu object per cluster into upgrade stage. In some instances after applying a CGU, the next CGU would fail to apply due to "etcdserver: mvcc: database space exceeded"

An example of the scripts output in sequence would look like:

# ./apply-cgu-ibu-prep-4-16-3.sh
Thu Aug  8 21:37:47 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0000 created
Thu Aug  8 21:57:47 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0001 created
Thu Aug  8 22:17:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0002 created
Thu Aug  8 22:37:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0003 created
Thu Aug  8 22:57:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0004 created
Thu Aug  8 23:17:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0005 created
Thu Aug  8 23:37:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0006 created
Thu Aug  8 23:57:48 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-prep-4-16-3-0007 created

# ./apply-cgu-ibu-upgrade-4-16-3.sh
Fri Aug  9 13:23:35 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0000 created
Fri Aug  9 13:38:36 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0001 created
Fri Aug  9 13:53:36 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0002 created
Fri Aug  9 14:08:36 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0003 created
Fri Aug  9 14:23:36 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0004 created
Fri Aug  9 14:38:36 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0005 created
Fri Aug  9 14:53:37 UTC 2024
Error from server: error when creating "/root/rhacm-ztp/ibu/scripts/cgu-ibu-upgrade-4-16-3-0006.yml": etcdserver: mvcc: database space exceeded
Fri Aug  9 15:08:37 UTC 2024
clustergroupupgrade.ran.openshift.io/ibu-upgrade-4-16-3-0007 created


While observing the upgrades and seeing the etcdserver error, I manually defragmented etcd and applied the CGU that was missing, the script could continue because I did those remediation in order to keep the test running.

This issue was not hit during every upgrade and never on the first upgrade performed during the scale tests. Sometimes it was hit rapidly a second time during a rollback or finalize stage as well.

Version-Release number of selected component (if applicable):

Hub OCP - 4.16.3
Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
TALM - 4.16.0

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Assignee:: Dean West

Reporter:: Alex Krzos

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/09 7:34 PM

Updated:: 2024/10/21 5:04 PM

Resolved:: 2024/09/02 10:31 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates