Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.0
Component/s: Etcd
Labels:
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
5
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
ETCD Sprint 263
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

While designing a solution to have these rarely run jobs included in component readiness, I discovered the etcd-scaling job is quite broken for some time. The problem seems to be invariant tests looking for "unexpected" things happening in the cluster.

It's possible some or all of these boil down to "this is expected during an etcd scaling operation" if a strong case can be made.

[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available

This one seems very common, examples:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1844042416286339072
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1841505588492636160
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1831358169155112960

[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available

Examples:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1841505588492636160

[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info

Examples:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1844042441238253568

[sig-node] node-lifecycle detects unexpected not ready node
[sig-node] node-lifecycle detects unreachable state on node

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1828821429399851008

It's likely more examples could be found here.

A lot to unravel here, but is it acceptable for operators (seemingly several) to go Available=False (a serious condition that would often result in someone getting alerted) during an etcd scaling operation?
Same question for unreachable nodes, and etcd member down alerts.

relates to

OCPBUGS-45672 vertical scaling: should not allow scale-down with only 3 healthy members

OCPBUGS-43565 etcd platform pod exist test failing on etcd-scaling jobs

OCPBUGS-20062 kube-storage-version-migrator goes Available=False with reason=KubeStorageVersionMigrator_Deploying during updates

Closed

OCPBUGS-44887 [CI] [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available

Closed

OCPBUGS-44244 Unexpected Node Not Ready Regression

Closed

OCPBUGS-44892 [CI] [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available

Closed

(1 relates to)

Assignee:: Jubitta John

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/10/15 5:07 PM

Updated:: 2025/07/20 1:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates