Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
Future Sustainability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

We are struggling with failure patterns where 20-60 tests fail in a set of job runs.

Examples:

upgrade failure can lead to 20+ failed tests: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-aws-ovn-upgrade/1997663581734178816
install failure can lead to 7+ failed tests: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-vsphere-static-ovn/1998074217861484544
some bugs can cause mass failures like ~~OCPBUGS-66420~~: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips/1998279551385341952

Many problems hit multiple variant combos at once, meaning you're quickly sorting through hundreds of component readiness regressions.

Would it be feasible to rely on choosing one test which if it fails, overrides all other failures in that run.

For example if "upgrade: [sig-cluster-lifecycle] Cluster completes upgrade" fails, that is the only one component readiness will show a regression for, all other test failures in those runs do not trigger regressions.

Or assume a "should not have mass e2e test failures" monitortest, if it fails, all of those regressions do not count and only this one would go regressed.

This would reduce the granularity of regressing the right component, but that's virtually unheard of in these situations regardless other than perhaps sometimes the right operator showing regressed for an install failure. (along with several other components) The overhead of getting the right bug to the right team would be likely less than the overhead of what we're sorting through today.

Additional thoughts:

if we could obtain the count of failed tests per job run and display it on test details reports with a different shade of red, that would be helpful
if the regression tracker stored the set of job runs observed for a regression, we could enhance the tooling that tries to tie regressions to an existing triage record. The current mechanism is helpful at times, but often difficult to trust without deep inspection. Tying on actual job runs would add a layer of confidence.

links to

openshift/origin#30797: TRT-2471: Create massive test failure junit

openshift/sippy#3285: TRT-2471: Consider mass failures during CR query

Assignee:: Ken Zhang

Reporter:: Devan Goodwin

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/12/09 2:43 PM

Updated:: 2026/03/04 2:09 PM

Resolved:: 2026/03/04 2:09 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates