Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Etcd
Labels:
- jlp-release-4.16:OCPBUGS-37820

Severity:
Moderate
Regression:
None
Story Points:
3
Sprint:
ETCD Sprint 257
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the health checks for the etcd Operator were not ordered. As a consequence, the health check sometimes failed even though all etcd members were healthy. The health-check failure triggered a scale-down event that caused the Operator to prematurely remove a healthy member. With this release, the health checks in the Operator are ordered. As a result, the health checks correctly reflect the health of etcd members and an incorrect scale-down event does not occur. (link:https://issues.redhat.com/browse/OCPBUGS-36462[*~~OCPBUGS-36462~~*])

Show
* Previously, the health checks for the etcd Operator were not ordered. As a consequence, the health check sometimes failed even though all etcd members were healthy. The health-check failure triggered a scale-down event that caused the Operator to prematurely remove a healthy member. With this release, the health checks in the Operator are ordered. As a result, the health checks correctly reflect the health of etcd members and an incorrect scale-down event does not occur. (link: https://issues.redhat.com/browse/OCPBUGS-36462 [* OCPBUGS-36462 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.17.0
Target Backport Versions:

4.14.z, 4.15.z, 4.16.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

Similar to ~~OCPBUGS-20061~~, but for a different situation:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort
pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact

In that test, since ~~ETCD-329~~, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).

It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.

Version-Release number of selected component

Seen in dev-branch CI. I haven't gone back to check older 4.y.

How reproducible

CI Search shows 20% impact, see my earlier query in this message.

Steps to Reproduce

Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.

Actual results

20% impact

Expected results

No hits.

blocks

OCPBUGS-37820 [4.16] control-plane-machine-set goes Available=False with UnavailableReplicas during etcd scale testing

Closed

is cloned by

OCPBUGS-37820 [4.16] control-plane-machine-set goes Available=False with UnavailableReplicas during etcd scale testing

Closed

is related to

ETCD-637 Update the vertical scaling test to not rely on CPMS status.readyReplicas

Closed

relates to

ETCD-329 Update the vertical scaling test to account for CPMSO

Closed

OCPBUGS-20061 control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

OCPBUGS-36301 [4.17] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims

Closed

OTA-362 CI: fail update suite if any ClusterOperator go Available=False

Closed

links to

openshift/cluster-etcd-operator#1308: OCPBUGS-36462: ensure ordering in member health checks

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

(2 relates to, 2 links to)

Assignee:: Haseeb Tariq

Reporter:: W. Trevor King

QA Contact:: Ge Liu

Doc Contact:: Laura Hinson

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/07/02 8:53 PM

Updated:: 2024/10/01 5:40 PM

Resolved:: 2024/10/01 5:40 PM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates