-
Feature Request
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
-
Proposed title of this feature request
Track skew related to a cluster's born-in version
What is the nature and description of the request?
Provide OCP CI periodics that track and fail on unexpected skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. Also provide a review process for triaging and responding to any drift that the CI detects, whether that's improving in-cluster component management to remove the skew, or granting an exception to allow the skew.
Why does the customer need this? (List the business requirements here)
We have occasional regressions due to cluster attributes that are configured at install-time and then not actively managed as the cluster updates. For example, a subset of the old-install update risks declared in cincinnati-graph-data:
$ git log -p -G born_by -U100 | grep '^[+]name: ' | sort | uniq +name: AWSOldBootImages +name: AWSOldBootImagesLackAfterburn +name: AzureDefaultVMType +name: CSRNotApprovedBadCerts +name: EarlyAPICertRotation +name: OldBootImagesPodmanMissingAuthFlag +name: OVNKubeMasterDSPrestop +name: ReleaseDataWithHyphenPrefix
For more details on each of those risks, see:
$ for RISK in AWSOldBootImages AWSOldBootImagesLackAfterburn AzureDefaultVMType CSRNotApprovedBadCerts EarlyAPICertRotation OldBootImagesPodmanMissingAuthFlag OVNKubeMast erDSPrestop ReleaseDataWithHyphenPrefix; do echo -n "${RISK} "; grep -h 'url:' $(git grep -l "name: ${RISK}\$") | sort | uniq; done AWSOldBootImages url: https://issues.redhat.com/browse/COS-1942 AWSOldBootImagesLackAfterburn url: https://issues.redhat.com/browse/MCO-519 AzureDefaultVMType url: https://issues.redhat.com/browse/OCPCLOUD-2409 CSRNotApprovedBadCerts url: https://issues.redhat.com/browse/MCO-1091 EarlyAPICertRotation url: https://issues.redhat.com/browse/API-1687 OldBootImagesPodmanMissingAuthFlag url: https://issues.redhat.com/browse/MCO-540 OVNKubeMasterDSPrestop url: https://issues.redhat.com/browse/SDN-4196 ReleaseDataWithHyphenPrefix url: https://access.redhat.com/solutions/6965075
By tracking skew between release, we can understand our exposure to those kinds of regressions, and make informed decisions about when the regression risk is worth fixing (by improving in-cluster controllers to manage that cluster attribute) or accepting (because fixing in-cluster control would be a significant lift). In some cases, like the old-boot-image regressions (COS-1942, MCO-519, etc.), we were aware of the risk, and eventually able to prioritize improving in-cluster control (RFE-817, OCPSTRAT-98, MCO-994). In other cases, like EarlyAPICertRotation (API-1687), the SecretTypeTLS vs. kubernetes.io/tls skew went undetected until long-lived staging clusters updated into the regression.
List any affected packages or components
deads@redhat.com suggested the test strategy of diffing must-gathers to detect skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. The bulk of the initial lift will be building that tool, and teaching it that:
- metadata.resourceVersion is expected to diverge, because it depends on timing details, and is not relevant to most controller activity.
- Secret type is important, and we want to hear about SecretTypeTLS vs. kubernetes.io/tls skew.
- Secret values are not important. E.g. different clusters are expected to have different server certificates. And must-gather's redaction will limit what we have access to in the Secret-value space anyway.
- All the other things that might be different and matter vs. not mattering. Likely there will be a need for domain experts from many parts of OpenShift contributing this knowledge, to get it to the point that it could be handed off to TRT or the patch-manager or whoever for triage.
David had initially floated mfojtik@redhat.com for triage/priority on this must-gather differ. More recently, he floated the updates team. This new RFE is an attempt to get the feature as a whole into the Unified Backlog input queue, so folks can discuss whether I'm capturing the ask clearly, which teams should be involved in filling out the initial differ tool, and where that work fits in vs. the other priorities those teams have.
If/when the must-gather differ is in place, folks would need to decide how much CI/triage capacity to invest in running it. Would you cover just 4.(dev-1) to 4.dev? Also 4.(dev-2) to 4.(dev-1) to 4.dev? How much skew for updates into 4.dev? Would you also cover chains ending in released 4.y, or just 4.dev? Would you cover various clouds (AWS vs. Azure vs. ...)? TechPreviewNoUpgrade or just GA? Standalone and hosted/HyperShift, or just one? All the usual CI-combinatorics questions. It could probably be bolted into the origin test-suite or some of the step-registry's update workflows. But HyperShift doesn't use either of those for testing updates, so again, it would be good to have a coverage plan that was distinct from the coverage implementation selected to deliver (at least) that plan.
- is related to
-
OCPSTRAT-714 Provide Detailed Administrative Control of all OCP Certs and Keys
- In Progress