-
Outcome
-
Resolution: Unresolved
-
Undefined
-
None
-
100% To Do, 0% In Progress, 0% Done
-
False
-
Outcome Overview
Konflux has removed the speed limit on product builds. AI may further reduce the time from idea to pull request. Now other bottlenecks are becoming more clear.
- Why does it take a PhD in Prowology to catch weekly OpenShift regressions?
- Why are humans needed to release code that has been proven reliable through test automation?
- Why are some fixes just sitting around waiting to be released for weeks?
- Why does getting help from SHIP almost require that I understand their team structure or, worse still, know them personally?
- Why is it so hard to know who works on what when you really do need a human?
Success Criteria
- The OpenShift Technical Release process scales to the point the half of TRT time is spent on efforts that benefit more than just OCP
- Metrics would come from Activity Types and Github data. This is likely dependent on successfully partnering with OCP to sharing CI watcher duties.
- Sustaining the OpenShift Component Readiness triage and remediation is a 50/50 split between OpenShift and SHIP developers taking no more than 20% of TRT time.
- Our OpenShift nightly success rate is a reasonable indicator of how much time we're having to dedicate. This is especially true for sprint releases (ECs) and release candidates. There are efforts within SHIPs control to increase nightly success to the point at which the TRT rarely has to have ad-hoc discussions for OpenShift ECs and RCs to be created.
- Less than 2% of engineer time under Tracy is spent in Konflux
- A criteria for selecting which products will onboard to the ART Konflux pipeline is how much time is currently lost to manual Konflux work. We plan to be able to report on time saved with each product on-boarded. Near the start of Q2 2026 we should have a clear understanding of which teams want to on-board.
- All OpenShift maintence streams ship Errata every week. All other products ship at whatever cadence their development team prefers.
- We'll target 3 out of 4 weeks meeting the OCP criteria by the end of Q2 2026.
- Is every team able to ship a fix within 24-hours if necessary?
- A single, well known entry point for SHIP help with support SLAs
- Our most interrupt driven teams, ART, Test Platform and CRT already keep metrics on their tickets. We'll be able to report a reduction in users going straight to their channel for help.
- We plan to have the standard "user satisfaction" metrics for technical helpdesk:
- Accurate initial triage to the team that can ultimately solve the problem
- First Contact Resolution (especially helpful for gauging AI-assistance accuracy)
- Ticket Backlog
- Average Ticket Resolution
- Automation that quickly and accurately:
- Provides developers with access to the services they need
- We'll have metrics on time saved waiting for important things like new repository provisioning.
- Identifies the owner of each step in important workflows
- We'll show metrics on efficiencies gained by shifting QE verification and Component Readiness triage further left.
- Stays up to date
- You can actually consider this a variation of #2 above. The process for updating the organization data that powers this automation must be accurate. Since it's backed in git we'll be able to report metrics on how often it's being updated. With the rate of change at Red Hat we should assume if this churn in this repo ever flatlines it likely means the data is stale.
- Provides developers with access to the services they need
- And, most important for scale, doing all the above without diminishing our services to OpenShift nor customizing said services to the degree they only work for OpenShift
Expected Results (what, how, when)
The overwhelming majority of what SHIP does contributes to the Success Criteria mentioned above. If you subscribe to this Outcome you’ll be treated to a weekly update on our progress.
We’re clearly still organizing this Outcome in Jira, but the back of the napkin estimates are:
- #1 will need half of 2026.
- We’re currently working with OCP on a joint CI SRE process that we believe will free up an entire engineer on our TRT. We should have this working smooth enough to call done by the end of Q1.
- Through hiring an additional engineer (or two) and reclaiming one from the bullet point above we expect to meet our goal of having the TRT other large teams beyond OpenShift (think HCM and Virt)
- Having more TRT engineers on forward thinking projects instead of firefighting will allow us to improve the tooling that helps engineers to understand the CI signal. The key is making the tooling effective enough that it’s used throughout the entire development cycle (not just reluctantly a few frantic weeks before a big release).
- #2 will probably take us into early 2027, but it won’t seem that long since there will be a steady stream of components making their way onto the ART Konflux pipeline.
- #3 is well underway with pre-merge verification and Konflux automation. By Q3 everything on the ART pipeline shipping every week. In the meantime you will see
- Our progress towards having 95% of tests live where their corresponding features live.
- Popular AI tools will even know how add-on components are tested on OpenShift. If there is any doubt, leaders will ask them where the tests are and what features are covered and what the CI results mean.
- Confidence in test coverage leads us to remove all human intervention for releases that pass tests.
- #4 will start with a unified channel for support by the end of Q1 2026 with SRE-lite processes established by the end of Q3.
- #5a is becoming more and more of a reality with Cyborg taking over github account management. Even parts of b are already happening with automatic escalation of premerge verification delays. You can expect to see improvements that help pull requests land sooner by the end of Q2.