-
Epic
-
Resolution: Duplicate
-
Undefined
-
None
-
None
-
Sev 3 Incident ITN-2025-00198 - Prevent future issues from PAC
-
False
-
-
False
-
To Do
-
-
Epic Goal
- RCA for Sev 3 Incident ITN-2025-00198
Why is this important?
Regarding what could have prevented this from happening and sped up the resolution:
PaC nightly builds switching from v0.36.x to v0.35.x - it's still not clear why this happened. There were some hotfixes going in at the time which might be involved but the fix was for the git resolver so it's unclear if the hotfix was related or just circumstantially happening at the same time. If this was clear it might have prevented a version mismatch happening to begin with.
PaC version was overridden in staging to a different minor version for almost a week - this ultimately was the cause for the build-service change to be promoted to production. This should ideally have been on a different cluster, been reverted once testing was done, or the custom build should have been off of the same version as production. Ultimately though the staging override was only an issue (and possibly only necessary?) because the version had been downgraded in production.
PaC v0.35.3 was released over a month after the latest v0.36.x release and included many bugfixes. If the v0.36.x minor version had been up-to-date with the past month's bug-fixes, the fix could have been rolled out much more swiftly - it took quite a long time to go back and forth and identify whether or not the month-old v0.36.0 was safe to deploy. The communication loop here was very slow between the timezones and the day of learning. If bug fixes are applied to a minor version in a project, they should be applied to all actively maintained minor versions and each of those minor versions should release a patch version. PaC v0.35.x having two patch releases while v0.36.x went a month without any of those patches is not good practice. Additionally
The Openshift Operator lists what "versions" of each component are included, but this list is updated entirely separately from the components. Nightly builds had listed PaC as being on v0.36.0 since June 27 even after the v0.35.3 update.
Had there been an upstream PaC release which was obviously up-to-date and should have been deployed, we would still have issues in our build/release process:
The nightly build updates automatically, which is nice but it also means there are other unrelated changes pulled in which in this case caused the nightly-index to be unsuitable for release. This works okay for general releases but is not ideal for hot fixes; there is no easy way to build an index based on a previous index but including hand-picked changes.
Nightly build QE testing: the QE team's day ends just after my day begins. In the 2-3 hour overlap there is not much time to coordinate testing unless the tests can all be done concurrently which was not the case here
QE testing visibility - it's unclear if non-QE team members can see the results of the automated QE tests for a given build. This makes it difficult to own incident resolution end to end, given the above point about timezones
The actual build process is quite convoluted. It's much more reliable than it was several months ago, and I have documented the steps I took to get this build out, but there are a lot of steps and several points where there are some very manual steps. Chasing down the newly released image digest, or checking the various pull requests and all their builds are two examples. I have more specific thoughts on this which I'm going to discuss with the p12n team
(see the timestamps regarding when the various builds were started, completed, and tested)
There is no internal documentation (yet) for the workaround required when the index is bad. There were necessary steps here which were both unknown/unconfirmed and the CCE team does not have permissions to perform. (bug ticket to make these steps unnecessary: SRVKP-8332)
Scenarios
- ...
Acceptance Criteria (Mandatory)
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- …
Open questions::
- …
Done Checklist
- Acceptance criteria are met
- Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
- User Journey automation is delivered
- Support and SRE teams are provided with enough skills to support the feature in production environment
- duplicates
-
SRVKP-8487 Nightly build automation improvements
-
- New
-