[OCPBUGS-38015] [4.16] etcd vertical scaling test should not rely on CPMS status.readyReplicas

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.z
Affects Version/s: 4.14, 4.15, 4.16, 4.17
Component/s: Etcd
Labels:
None

Test Coverage:

+
Severity:
Important
Regression:
None
Story Points:
1
Sprint:
ETCD Sprint 257
sprint_count:
1
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.16.z
Target Backport Versions:

4.14, 4.15, 4.16

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-37837~~. The following is the description of the original issue:
—
In our vertical scaling test, after we delete a machine, we rely on the `status.readyReplicas` field of the ControlPlaneMachineSet (CPMS) to indicate that it has successfully created a new machine that let's us scale up before we scale down.
https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L76-L87

As we've seen in the past as well, that status field isn't a reliable indicator of the scale up of machines, as status.readyReplicas might stay at 3 as the soon-to-be-removed node that is pending deletion can go Ready=Unknown in runs such as the following: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1286/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling/1808186565449486336

Which then ends up the test timing out on waiting for status.readyReplicas=4 while the scale-up and down may already have happened.
This shows up across scaling tests on all platforms as:

fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:81]: Unexpected error:
    <*errors.withStack | 0xc002182a50>: 
    scale-up: timed out waiting for CPMS to show 4 ready replicas: timed out waiting for the condition
    {
        error: <*errors.withMessage | 0xc00304c3a0>{
            cause: <wait.errInterrupted>{
                cause: <*errors.errorString | 0xc0003ca800>{
                    s: "timed out waiting for the condition",
                },
            },
            msg: "scale-up: timed out waiting for CPMS to show 4 ready replicas",
        },

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling/1811686448848441344

https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522etcd-scaling%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=net_improvement

In hindsight all we care about is whether the deleted machine's member is replaced by another machine's member and can ignore the flapping of node and machine statuses while we wait for the scale-up then down of members to happen. So we can relax or replace that check on status.readyReplicas with just looking at the membership change.

PS: We can also update the outdated Godoc comments for the test to mention that it relies on CPMSO to create a machine for us https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L34-L38

blocks

OCPBUGS-38086 [4.15] etcd vertical scaling test should not rely on CPMS status.readyReplicas

Closed

clones

OCPBUGS-37837 etcd vertical scaling test should not rely on CPMS status.readyReplicas

Closed

is blocked by

OCPBUGS-37837 etcd vertical scaling test should not rely on CPMS status.readyReplicas

Closed

is cloned by

OCPBUGS-38086 [4.15] etcd vertical scaling test should not rely on CPMS status.readyReplicas

Closed

links to

openshift/origin#28981: [release-4.16] OCPBUGS-38015: vertical scaling test should not rely on CPMS replicas

RHBA-2024:5422 OpenShift Container Platform 4.16.z bug fix update

(1 links to)

Errata Tool added a comment - 2024/08/20 3:22 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Important: OpenShift Container Platform 4.16.8 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:5422

Errata Tool added a comment - 2024/08/20 3:22 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.16.8 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:5422

Thomas Jungblut added a comment - 2024/08/12 8:24 AM - edited

last 4.16 periodic was healthy again, setting to verified

Thomas Jungblut added a comment - 2024/08/12 8:24 AM - edited last 4.16 periodic was healthy again, setting to verified

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/08/20 3:22 PM

Expand comment: Errata Tool added a comment - 2024/08/20 3:22 PM

Collapse comment: Thomas Jungblut added a comment - 2024/08/12 8:24 AM, Edited by Thomas Jungblut - 2024/08/12 8:24 AM

Expand comment: Thomas Jungblut added a comment - 2024/08/12 8:24 AM, Edited by Thomas Jungblut - 2024/08/12 8:24 AM

People

Dates