Loading...

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
Distinguish update risks as MachineConfig-scoped or not
Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Epic Status:
To Do

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Epic Goal*

Distinguish update risks as MachineConfig-scoped or not. To allow folks to figure out if a MachineConfig bump (i.e. new kubelet/CRI-O/RHCOS) is risky, or if a control-plane-components bump is risky, or both.

Why is this important? (mandatory)

HyperShift allows distinct HostedCluster and NodePool updates, which is focusing UX effort on a distinction that has been possible in stand-along clusters for a while via paused MachineConfigPools. Issues like ARM64SecCompError524 only occur on MachineConfig updates that bring in a vulnerable RHCOS, and folks could freely update their control-plane components without exposing themselves to that risk. StaleInsightsRunLevelLabel, on the other hand, only exposes the control-plane, and folks could freely update their MachineConfig components without exposing themselves to that risk.

~~OTA-267~~ added the "which types of clusters?" scoping to update risks, and that has helped reduce the fleet-impact of newly-discovered risks by encouraging unaffected clusters continue to update. This epic extends that effort to be "which parts of those clusters?", and that helps reduce the fleet-impact of newly-discovered risks by encouraging unaffected cluster-components to continue to update.

One aspect of the benefit is that we have little control and a long lead time for fixing RHEL-side issues like ARM64SecCompError524, and it would help managed clusters a lot to be able to ignore those risks when updating the HostedCluster bits we run for customers.

Scenarios (mandatory)

HyperShift systems are able to update their control plane components (e.g. HostedCluster) regardless of MachineConfig-side risks in their target release.

HyperShift systems are able to update their MachineConfig components (e.g. NodePools) regardless of control-plane-side risks in their target release.

Dependencies (internal and external) (mandatory)

The OTA updates team can implement this unilaterally, but will require org-wide buy-in from impact-statement responders for the actual "which component?" data on each regression.

Various other Cincinnati/update-service clients will need to be trained to understand the distinction if they want to take advantage of its flexibility (MachineConfig operator? In-cluster web console? OCM? Labs graph browser? Etc.).

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Both Scenarios listed above can be exercised by QE using dummy update-service data to confirm their functionality.

Drawbacks or Risk (optional)

The additional flexibility will certainly introduce more complexity. In stand-alone OpenShift, where the only compute-decoupling knob was pausing MachineConfigPools, this flexibility didn't seem like it was worth the implementation cost. Now that HostedCluster and NodePool are becoming more separable, it may be worth paying the implementation and maintenance costs.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

is blocked by

OTA-267 Add capability for targeted edge blocking

Closed

is related to

OTA-885 Spike to findout if we need upgrade recomendation for node pool

Closed

relates to

OTA-791 Hosted control planes (HyperShift) should consume recomended updates from OSUS

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates