-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
Feature Overview{}
We want to significantly increase the likelihood of having an accepted nightly build every night in order to integrate OpenShift code more quickly into downstream processes such as Component Readiness and partner testing
Goals
- Remove a lot of noise from the CI signal that contributes to high number of nightly failures.
- Reduce the time from PR merged to accepted nightly by a few hundred percent.
- Get ECs and RCs "for free" as a by-product of our normal nightly CI process.
- A process that forces us to get slightly better whenever we fail (think mini-retro within SHIP)
Requirements
- We cannot reduce the effectiveness of Component Readiness in any significant way.
- The end result must be less work for SHIP engineers.
Questions to Answer (Optional):
Out of Scope
- My initial thought it to only consider changes that we have direct control over in SHIP. There is a fuzzy line however. We have done many things in SHIP that have improved product code quality.
Background
Take a moment to browse the amd64 payload stream for the development version of OpenShift. Here's the link to 4.21. At the time of this writing we haven't had an accepted nightly in 6 days.
A reasonable person would panic if the saw this data, however, most people don't know the higher level process we use to ensure the work of hundreds of engineers continues to make OpenShift better every release. The reality is, the engineering we've put into Component Readiness has made it so that we haven't needed to rely on nightly builds as our primary signal for quality.
That said, there are several downstream processes that rely on nightly quality.
- Accepted nightlies set the new baseline for periodics. This is the data that feeds Component Readiness.
- Every sprint we select a payload for our Engineering Candidate. This is a public release that partners other dependent teams rely on.
- By upgrading a portion of our our own critical infrastructure to Engineering Candidates, we'd caught no shortage of bugs that slipped through every other form of testing we have.
- After branching we generate release candidates from accepted nightlies.
Going long periods of time without accepted nightlies generates ad-hoc communication throughout SHIP.
- Without accepted nightlies the last week of the sprint, the TRT has to scramble to find some way to manually accept a nightly. This happens often and usually involved 5+ people talking for ~30 minutes plus an engineer or two going off on a special mission to debug an urgent but not important problem.
- One reason it's urgent is that, without ECs, Product Managers come to ART asking which payload contains a feature they want to preview and if it could be made public.
Ideas (in no particular order):{}
- Dramatically increasing nightly build frequency to give us more chances for success
- Removing low value testing from nightlies
- Moving more testing to CI payloads (could improve Component Readiness speed)
- Having a convenient “retry all tests for this payload”
- Having more advanced ways to revert to the Last Known Good point in time and giving the freedom to fix the problem according to their schedule.
- Keep identifying and eliminating sources of instability in our CI platform. We have better data than ever for doing this.