Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic Name:
Interop Layered Products in Component Readiness
Activity Type:
Future Sustainability
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:
None
Release Blocker:
None

Met with rh-ee-mpruitt yesterday, their primary goal is to reduce the manual toil in processing job results. They run roughly 40 jobs once a week, results are automatically categorized as either an infra failure or a test failure that needs investigation by a tool called firewatch, based on what ci registry step failed. Jira's are automatically created and routed either to the layered product team, or to Michael's team. (MPIIT) This manual toil sorting through the failures of each jira for their team (5-10 per day) is an issue. He believes there is a retry mechanism built into job runs where in some cases it will just try again, though this reportedly seldom works.

We discussed on-boarding them into component readiness to help determine when something is really wrong and glaze over intermittent/infrastructure failures, as well as prevent OpenShift from shipping if something is broken in a statistically significant way.

This would take time and proven stability before we could bring them into the main component readiness view, but in the meantime it would be relatively easy to add a custom view for the layered products here to explore what it looks like today if we hunt for regressions.

Given their run rate, these jobs may have to qualify for rarely run jobs raw pass rate comparisons. We should discuss with Michael if it's possible to get budget to run them all 7-10 times a week, this would give us much stronger regression detection.

We briefly discussed an idea around automatically retrying rarely run jobs if they fail by triggering more runs with gangway. Michael's team was interested in if there was a way to determine if a re-run is likely to help or not. From TRT PoV there's no harm in generating signal, it will either confirm there's a problem or help confirm there isn't.

Link to view their job runs: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.18?filters=%257B%2522items%2522%253A%255B%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522lp-interop%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=net_improveme

automated tickets that have been filed as a result of the output: https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=20602&view=detail&selectedIssue=SRVKP-6653#

Assignee:: Unassigned

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/11/15 3:38 PM

Updated:: 2025/08/27 4:13 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates