Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-5494

Track skew related to a cluster's born-in version

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • API, Auth, Over the Air, Testing
    • None
    • False
    • None
    • False
    • Not Selected

      Proposed title of this feature request

      Track skew related to a cluster's born-in version

      What is the nature and description of the request?

      Provide OCP CI periodics that track and fail on unexpected skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. Also provide a review process for triaging and responding to any drift that the CI detects, whether that's improving in-cluster component management to remove the skew, or granting an exception to allow the skew.

      Why does the customer need this? (List the business requirements here)

      We have occasional regressions due to cluster attributes that are configured at install-time and then not actively managed as the cluster updates. For example, a subset of the old-install update risks declared in cincinnati-graph-data:

      $ git log -p -G born_by -U100 | grep '^[+]name: ' | sort | uniq
      +name: AWSOldBootImages
      +name: AWSOldBootImagesLackAfterburn
      +name: AzureDefaultVMType
      +name: CSRNotApprovedBadCerts
      +name: EarlyAPICertRotation
      +name: OldBootImagesPodmanMissingAuthFlag
      +name: OVNKubeMasterDSPrestop
      +name: ReleaseDataWithHyphenPrefix
      

      For more details on each of those risks, see:

      $ for RISK in AWSOldBootImages AWSOldBootImagesLackAfterburn AzureDefaultVMType CSRNotApprovedBadCerts EarlyAPICertRotation OldBootImagesPodmanMissingAuthFlag OVNKubeMast
      erDSPrestop ReleaseDataWithHyphenPrefix; do echo -n "${RISK} "; grep -h 'url:' $(git grep -l "name: ${RISK}\$") | sort | uniq; done
      AWSOldBootImages url: https://issues.redhat.com/browse/COS-1942
      AWSOldBootImagesLackAfterburn url: https://issues.redhat.com/browse/MCO-519
      AzureDefaultVMType url: https://issues.redhat.com/browse/OCPCLOUD-2409
      CSRNotApprovedBadCerts url: https://issues.redhat.com/browse/MCO-1091
      EarlyAPICertRotation url: https://issues.redhat.com/browse/API-1687
      OldBootImagesPodmanMissingAuthFlag url: https://issues.redhat.com/browse/MCO-540
      OVNKubeMasterDSPrestop url: https://issues.redhat.com/browse/SDN-4196
      ReleaseDataWithHyphenPrefix url: https://access.redhat.com/solutions/6965075
      

      By tracking skew between release, we can understand our exposure to those kinds of regressions, and make informed decisions about when the regression risk is worth fixing (by improving in-cluster controllers to manage that cluster attribute) or accepting (because fixing in-cluster control would be a significant lift). In some cases, like the old-boot-image regressions (COS-1942, MCO-519, etc.), we were aware of the risk, and eventually able to prioritize improving in-cluster control (RFE-817, OCPSTRAT-98, MCO-994). In other cases, like EarlyAPICertRotation (API-1687), the SecretTypeTLS vs. kubernetes.io/tls skew went undetected until long-lived staging clusters updated into the regression.

      List any affected packages or components

      deads@redhat.com suggested the test strategy of diffing must-gathers to detect skew between clusters born in 4.y and clusters born in 4.(<y) and then updated to 4.y. The bulk of the initial lift will be building that tool, and teaching it that:

      • metadata.resourceVersion is expected to diverge, because it depends on timing details, and is not relevant to most controller activity.
      • Secret type is important, and we want to hear about SecretTypeTLS vs. kubernetes.io/tls skew.
      • Secret values are not important. E.g. different clusters are expected to have different server certificates. And must-gather's redaction will limit what we have access to in the Secret-value space anyway.
      • All the other things that might be different and matter vs. not mattering. Likely there will be a need for domain experts from many parts of OpenShift contributing this knowledge, to get it to the point that it could be handed off to TRT or the patch-manager or whoever for triage.

      David had initially floated mfojtik@redhat.com for triage/priority on this must-gather differ. More recently, he floated the updates team. This new RFE is an attempt to get the feature as a whole into the Unified Backlog input queue, so folks can discuss whether I'm capturing the ask clearly, which teams should be involved in filling out the initial differ tool, and where that work fits in vs. the other priorities those teams have.

      If/when the must-gather differ is in place, folks would need to decide how much CI/triage capacity to invest in running it. Would you cover just 4.(dev-1) to 4.dev? Also 4.(dev-2) to 4.(dev-1) to 4.dev? How much skew for updates into 4.dev? Would you also cover chains ending in released 4.y, or just 4.dev? Would you cover various clouds (AWS vs. Azure vs. ...)? TechPreviewNoUpgrade or just GA? Standalone and hosted/HyperShift, or just one? All the usual CI-combinatorics questions. It could probably be bolted into the origin test-suite or some of the step-registry's update workflows. But HyperShift doesn't use either of those for testing updates, so again, it would be good to have a coverage plan that was distinct from the coverage implementation selected to deliver (at least) that plan.

              wcabanba@redhat.com William Caban
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: