Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: API, Auth, Over the Air, Testing
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Proposed title of this feature request

Track skew related to a cluster's born-in version

What is the nature and description of the request?

Provide OCP CI periodics that track and fail on unexpected skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. Also provide a review process for triaging and responding to any drift that the CI detects, whether that's improving in-cluster component management to remove the skew, or granting an exception to allow the skew.

Why does the customer need this? (List the business requirements here)

We have occasional regressions due to cluster attributes that are configured at install-time and then not actively managed as the cluster updates. For example, a subset of the old-install update risks declared in cincinnati-graph-data:

$ git log -p -G born_by -U100 | grep '^[+]name: ' | sort | uniq
+name: AWSOldBootImages
+name: AWSOldBootImagesLackAfterburn
+name: AzureDefaultVMType
+name: CSRNotApprovedBadCerts
+name: EarlyAPICertRotation
+name: OldBootImagesPodmanMissingAuthFlag
+name: OVNKubeMasterDSPrestop
+name: ReleaseDataWithHyphenPrefix

For more details on each of those risks, see:

$ for RISK in AWSOldBootImages AWSOldBootImagesLackAfterburn AzureDefaultVMType CSRNotApprovedBadCerts EarlyAPICertRotation OldBootImagesPodmanMissingAuthFlag OVNKubeMast
erDSPrestop ReleaseDataWithHyphenPrefix; do echo -n "${RISK} "; grep -h 'url:' $(git grep -l "name: ${RISK}\$") | sort | uniq; done
AWSOldBootImages url: https://issues.redhat.com/browse/COS-1942
AWSOldBootImagesLackAfterburn url: https://issues.redhat.com/browse/MCO-519
AzureDefaultVMType url: https://issues.redhat.com/browse/OCPCLOUD-2409
CSRNotApprovedBadCerts url: https://issues.redhat.com/browse/MCO-1091
EarlyAPICertRotation url: https://issues.redhat.com/browse/API-1687
OldBootImagesPodmanMissingAuthFlag url: https://issues.redhat.com/browse/MCO-540
OVNKubeMasterDSPrestop url: https://issues.redhat.com/browse/SDN-4196
ReleaseDataWithHyphenPrefix url: https://access.redhat.com/solutions/6965075

By tracking skew between release, we can understand our exposure to those kinds of regressions, and make informed decisions about when the regression risk is worth fixing (by improving in-cluster controllers to manage that cluster attribute) or accepting (because fixing in-cluster control would be a significant lift). In some cases, like the old-boot-image regressions (~~COS-1942~~, ~~MCO-519~~, etc.), we were aware of the risk, and eventually able to prioritize improving in-cluster control (RFE-817, ~~OCPSTRAT-98~~, ~~MCO-994~~). In other cases, like EarlyAPICertRotation (API-1687), the SecretTypeTLS vs. kubernetes.io/tls skew went undetected until long-lived staging clusters updated into the regression.

List any affected packages or components

deads@redhat.com suggested the test strategy of diffing must-gathers to detect skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. The bulk of the initial lift will be building that tool, and teaching it that:

metadata.resourceVersion is expected to diverge, because it depends on timing details, and is not relevant to most controller activity.
Secret type is important, and we want to hear about SecretTypeTLS vs. kubernetes.io/tls skew.
Secret values are not important. E.g. different clusters are expected to have different server certificates. And must-gather's redaction will limit what we have access to in the Secret-value space anyway.
All the other things that might be different and matter vs. not mattering. Likely there will be a need for domain experts from many parts of OpenShift contributing this knowledge, to get it to the point that it could be handed off to TRT or the patch-manager or whoever for triage.

David had initially floated mfojtik@redhat.com for triage/priority on this must-gather differ. More recently, he floated the updates team. This new RFE is an attempt to get the feature as a whole into the Unified Backlog input queue, so folks can discuss whether I'm capturing the ask clearly, which teams should be involved in filling out the initial differ tool, and where that work fits in vs. the other priorities those teams have.

If/when the must-gather differ is in place, folks would need to decide how much CI/triage capacity to invest in running it. Would you cover just 4.(dev-1) to 4.dev? Also 4.(dev-2) to 4.(dev-1) to 4.dev? How much skew for updates into 4.dev? Would you also cover chains ending in released 4.y, or just 4.dev? Would you cover various clouds (AWS vs. Azure vs. ...)? TechPreviewNoUpgrade or just GA? Standalone and hosted/HyperShift, or just one? All the usual CI-combinatorics questions. It could probably be bolted into the origin test-suite or some of the step-registry's update workflows. But HyperShift doesn't use either of those for testing updates, so again, it would be good to have a coverage plan that was distinct from the coverage implementation selected to deliver (at least) that plan.

is related to

OCPSTRAT-714 Provide Detailed Administrative Control of all OCP Certs and Keys

In Progress

Details

Description

Proposed title of this feature request

What is the nature and description of the request?

Why does the customer need this? (List the business requirements here)

List any affected packages or components

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates