Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1877

Onboard Interop Layered Product Jobs in Sippy Component Readiness

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Interop Layered Products in Component Readiness
    • Future Sustainability
    • 100% To Do, 0% In Progress, 0% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • None
    • None
    • None

      Met with rh-ee-mpruitt yesterday, their primary goal is to reduce the manual toil in processing job results. They run roughly 40 jobs once a week, results are automatically categorized as either an infra failure or a test failure that needs investigation by a tool called firewatch, based on what ci registry step failed. Jira's are automatically created and routed either to the layered product team, or to Michael's team. (MPIIT) This manual toil sorting through the failures of each jira for their team (5-10 per day) is an issue. He believes there is a retry mechanism built into job runs where in some cases it will just try again, though this reportedly seldom works.

      We discussed on-boarding them into component readiness to help determine when something is really wrong and glaze over intermittent/infrastructure failures, as well as prevent OpenShift from shipping if something is broken in a statistically significant way.

      This would take time and proven stability before we could bring them into the main component readiness view, but in the meantime it would be relatively easy to add a custom view for the layered products here to explore what it looks like today if we hunt for regressions.

      Given their run rate, these jobs may have to qualify for rarely run jobs raw pass rate comparisons. We should discuss with Michael if it's possible to get budget to run them all 7-10 times a week, this would give us much stronger regression detection.

      We briefly discussed an idea around automatically retrying rarely run jobs if they fail by triggering more runs with gangway. Michael's team was interested in if there was a way to determine if a re-run is likely to help or not. From TRT PoV there's no harm in generating signal, it will either confirm there's a problem or help confirm there isn't.

      Link to view their job runs: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.18?filters=%257B%2522items%2522%253A%255B%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522lp-interop%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=net_improveme

      automated tickets that have been filed as a result of the output: https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=20602&view=detail&selectedIssue=SRVKP-6653#

              Unassigned Unassigned
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: