Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-849

Distinguish update risks as MachineConfig-scoped or not

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • Distinguish update risks as MachineConfig-scoped or not
    • False
    • None
    • False
    • Not Selected
    • To Do
    • 0
    • 0% 0%

      Epic Goal*

      Distinguish update risks as MachineConfig-scoped or not.  To allow folks to figure out if a MachineConfig bump (i.e. new kubelet/CRI-O/RHCOS) is risky, or if a control-plane-components bump is risky, or both.

      Why is this important? (mandatory)

      HyperShift allows distinct HostedCluster and NodePool updates, which is focusing UX effort on a distinction that has been possible in stand-along clusters for a while via paused MachineConfigPools.  Issues like ARM64SecCompError524 only occur on MachineConfig updates that bring in a vulnerable RHCOS, and folks could freely update their control-plane components without exposing themselves to that risk. StaleInsightsRunLevelLabel, on the other hand, only exposes the control-plane, and folks could freely update their MachineConfig components without exposing themselves to that risk.

      OTA-267 added the "which types of clusters?" scoping to update risks, and that has helped reduce the fleet-impact of newly-discovered risks by encouraging unaffected clusters continue to update. This epic extends that effort to be "which parts of those clusters?", and that helps reduce the fleet-impact of newly-discovered risks by encouraging unaffected cluster-components to continue to update.

      One aspect of the benefit is that we have little control and a long lead time for fixing RHEL-side issues like ARM64SecCompError524, and it would help managed clusters a lot to be able to ignore those risks when updating the HostedCluster bits we run for customers.

      Scenarios (mandatory) 

      HyperShift systems are able to update their control plane components (e.g. HostedCluster) regardless of MachineConfig-side risks in their target release.

      HyperShift systems are able to update their MachineConfig components (e.g. NodePools) regardless of control-plane-side risks in their target release.

      Dependencies (internal and external) (mandatory)

      The OTA updates team can implement this unilaterally, but will require org-wide buy-in from impact-statement responders for the actual "which component?" data on each regression.

      Various other Cincinnati/update-service clients will need to be trained to understand the distinction if they want to take advantage of its flexibility (MachineConfig operator? In-cluster web console? OCM? Labs graph browser? Etc.).

      Contributing Teams(and contacts) (mandatory) 

      Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

      • Development - 
      • Documentation -
      • QE - 
      • PX - 
      • Others -

      Acceptance Criteria (optional)

      Both Scenarios listed above can be exercised by QE using dummy update-service data to confirm their functionality.

      Drawbacks or Risk (optional)

      The additional flexibility will certainly introduce more complexity. In stand-alone OpenShift, where the only compute-decoupling knob was pausing MachineConfigPools, this flexibility didn't seem like it was worth the implementation cost. Now that HostedCluster and NodePool are becoming more separable, it may be worth paying the implementation and maintenance costs.

      Done - Checklist (mandatory)

      The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

      • CI Testing - Tests are merged and completing successfully
      • Documentation - Content development is complete.
      • QE - Test scenarios are written and executed successfully.
      • Technical Enablement - Slides are complete (if requested by PLM)
      • Other 

            Unassigned Unassigned
            trking W. Trevor King
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: