Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Planes, MCO
Labels:
- adbe

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

Alternate profiles for ordering the worker nodes for cluster updates

2. What is the nature and description of the request?

First, there is certain order [1] in which worker nodes are picked by Machine Config Operator for cluster updates. Second, there is MaxUnavailable [2] in MCO that allows customers to pick multiple worker nodes within a MCP for updates simultaneously. Third, draining of nodes is subject [3] to PDBs and Eviction APIs.

Customers want alternate profiles that allow them to complete cluster updates more quickly in cases where the combination of above three mechanisms makes the cluster updates ’stuck’. By ’stuck’, the upgrade stalls by either picking node(s) that can not successfully drain or picking incorrect number of nodes for simultaneous drain.

3. Why does the customer need this? (List the business requirements here)

Customers need this in ROSA HCP where it is a) already possible to upgrade machine pools both independently and simultaneously b) There is separate machine config for each node pool and c) MaxUnavailable is available in the HyperShift API. [4] d) it is possible to create multiple node pools

The workloads in the cluster are such that there is no guarantee on PDBs. That is a) workloads may not even have PDBs. b) workloads may have PDBs that do not allow any disruptions c) workloads may have PDBs that allow disruption. The administrator who performs cluster update will not know before hand which node pools or workloads fall in to any of the above 3 pattern.

When MaxUnavailable is chosen such that chosen number of nodes can not drain, the cluster update will not progress for hours if not days. When clusters can have 500 nodes, the cluster update can take days. It requires a trial and error to pick right MaxUnavailable for every cluster update or day-2 configuration tasks like changing KubeletConfig.

4. List any affected packages or components.

HyperShift Operator, Machine Config Controller

[1] https://docs.openshift.com/container-platform/4.15/post_installation_configuration/machine-configuration-tasks.html#machine-config-overview-post-install-machine-configuration-tasks

[2] https://docs.openshift.com/container-platform/4.15/updating/understanding_updates/how-updates-work.html#mco-update-process_how-updates-work

[3] https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-gracefully_nodes-nodes-rebooting

[4] https://hypershift-docs.netlify.app/how-to/automated-machine-management/configure-machines/

is related to

RFE-1489 Continue with nodes upgrade if a node cannot be drained

Refinement

RFE-3134 Warnings for blocking PodDisruptionBudgets

Closed

OCPSTRAT-789 Unblock OCP upgrades from PDB block for uninterrupted upgrade experience

Backlog

Assignee:: Ramon Acedo

Reporter:: Balachandran Chandrasekaran

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/04/11 7:13 PM

Updated:: 2025/07/31 1:57 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates