[RFE-6413] Provision spare capacity (workers) before upgrade of MachineConfigPool is starting to quickly relocate workload

Type: Feature Request
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Cluster Infrastructure
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:
PX Impact Score:
PX Review Complete:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1. Proposed title of this feature request
Provision spare capacity (workers) before upgrade of MachineConfigPool is starting to quickly relocate workload

2. What is the nature and description of the request?
During OpenShift Container Platform 4 updates or MachineConfig udpates, all nodes of specific MachineConfigPool are restarted. With maxUnavailable one can control, how much machines at one from this pool can be updated in parallel to speed up the process. With that, OpenShift Container Platform 4 also have general knowledge of how many worker is going to be offline soon and that additional capacity may be required. While it's possible to automatically scale workers based on Pending pods, it would be nice that during maintenance tasks, OpenShift Container Platform 4 is able to scale-up additioal workers based on the number configured in maxUnavailable within the MachineConfigPool, to prevent pods stuck in Pending state, waiting for the work to come up. That way, workload relocation is happening faster and it does avoid scale-up during potential peaks when updates are running. With that workload disruption (for non cloud native workload) can be lowered and overall experice can be improved.

Given this is only possible where Machine scaling is available, this functionality should be made optional as not everyone will have the desire for such functionality or even the capacity.

3. Why does the customer need this? (List the business requirements here)
During updates, it's desired to process worker updates as quickly as possible and therefore to set maxUnavailable to a high value. This in turn can cause many pods stuck in Pending state until additional workers are provisioned. Given that OpenShift Container Platform 4 has the knowledge of how many workers are going to be offline before the update starts, it would be helpful to scale-up the number of maxUnavailable with additional works to allow smooth transition of the workload being eveicted and thus prevent long wait time, when additional capacity needs to be provisioned. Especially for non cloud native workload this is painful and could be solved fairly easily with some logic, given that all data are already available and only scaling would need to happen. Certainly, at the end of the activity the number of workers should be brought down to the original number or Cluster Autoscaler logic should be triggered to reduce the number of workers again after the update completed.

4. List any affected packages or components.
Machine Config Operator

There are no comments yet on this issue.

Assignee:: Subin M

Reporter:: Simon Reber

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/10/14 3:44 PM

Updated:: 2025/03/07 12:09 AM

Resolved:: 2024/11/14 4:31 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates