[OCPBUGS-43565] etcd platform pod exist test failing on etcd-scaling jobs

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.0
Component/s: Etcd
Labels:
None

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times

This new test appears to be a problem on etcd-scaling jobs where the exits are presently expected.

Example failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling/1846579352007872512

An exception needs to be added, however we do not have a mechanism to add an exception for within a specific job right now, all we have to go on here is job name which is an imperfect way to disable these tests.

We don't want to disable the whole monitortest as that would shut down checks for pod exits on all the other control plane pods in the etcd-scaling job.

https://redhat-internal.slack.com/archives/C027U68LP/p1729174182218029 has details on why the test is presently expected to fail and some thoughts around how this could be solved:

via deads:
1. finding a way to avoid restarting containers is best
2. finding a way to make the exit more graceful is next best
3. skipping only the exact pod pattern on the exact test is next best (not on all jobs, only the scaling ones)

skipping the monitor test is not viable.

An option for 3 could be an env var that is applied only in the etcd-scaling job configuration, and the test could look for. Namespaces to skip the check on, comma separated or similar.

is related to

OCPBUGS-43379 etcd-scaling jobs failing ~60% of the time

Forrest Babcock added a comment - 2024/11/04 4:15 PM

Moved to etcd component to review potential solutions to the failure listed above:

1. finding a way to avoid restarting containers is best
2. finding a way to make the exit more graceful is next best
3. skipping only the exact pod pattern on the exact test is next best (not on all jobs, only the scaling ones)

Forrest Babcock added a comment - 2024/11/04 4:15 PM Moved to etcd component to review potential solutions to the failure listed above: 1. finding a way to avoid restarting containers is best 2. finding a way to make the exit more graceful is next best 3. skipping only the exact pod pattern on the exact test is next best (not on all jobs, only the scaling ones)

Assignee:: Dean West

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/10/18 5:07 PM

Updated:: 2024/11/26 3:59 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Forrest Babcock added a comment - 2024/11/04 4:15 PM

Expand comment: Forrest Babcock added a comment - 2024/11/04 4:15 PM

People

Dates