-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Interop Layered Products in Component Readiness
-
Future Sustainability
-
100% To Do, 0% In Progress, 0% Done
-
False
-
-
False
-
Not Selected
-
None
-
None
-
None
Met with rh-ee-mpruitt yesterday, their primary goal is to reduce the manual toil in processing job results. They run roughly 40 jobs once a week, results are automatically categorized as either an infra failure or a test failure that needs investigation by a tool called firewatch, based on what ci registry step failed. Jira's are automatically created and routed either to the layered product team, or to Michael's team. (MPIIT) This manual toil sorting through the failures of each jira for their team (5-10 per day) is an issue. He believes there is a retry mechanism built into job runs where in some cases it will just try again, though this reportedly seldom works.
We discussed on-boarding them into component readiness to help determine when something is really wrong and glaze over intermittent/infrastructure failures, as well as prevent OpenShift from shipping if something is broken in a statistically significant way.
This would take time and proven stability before we could bring them into the main component readiness view, but in the meantime it would be relatively easy to add a custom view for the layered products here to explore what it looks like today if we hunt for regressions.
Given their run rate, these jobs may have to qualify for rarely run jobs raw pass rate comparisons. We should discuss with Michael if it's possible to get budget to run them all 7-10 times a week, this would give us much stronger regression detection.
We briefly discussed an idea around automatically retrying rarely run jobs if they fail by triggering more runs with gangway. Michael's team was interested in if there was a way to determine if a re-run is likely to help or not. From TRT PoV there's no harm in generating signal, it will either confirm there's a problem or help confirm there isn't.
automated tickets that have been filed as a result of the output: https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=20602&view=detail&selectedIssue=SRVKP-6653#