-
Feature Request
-
Resolution: Unresolved
-
Normal
-
None
-
all
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
During OCP Upgrade, include feature to
- display next worker node, that is going to be picked for upgrade
- add delay with manual intervention between 2 worker nodes
2. What is the nature and description of the request?
Currently, worker nodes are randomly picked up during OCP upgrade in progress.
As we have limited capacity of GPU nodes, with limited to no buffer capacity available and with no visibility towards the next worker node, that is going to be picked for OCP upgrade; it becomes difficult for us to arrange buffer capacity for the needed profile of GPU node, that gets randomly picked next for upgrade.
Due to no buffer capacity available, the next randomly picked GPU node fails to drain its workloads, causing the OCP upgrade to get stuck and also cause outage for our clients; thereby, leading to manual intervention and significant drop in availability of the platform for AI/ML workloads
3. Why does the customer need this? (List the business requirements here)
The visibility toward next worker node to be picked for OCP Upgrade, will help us prepare for buffer capacity in advance, for the GPU profile of the next node
This manually controlled delay between 2 randomly picked GPU nodes will help us in buying in sufficient time to make arrangements for the needed GPU Profile node and ensure that OCP upgrade doesnt get stuck due to no buffer capacity and avoid outage for clients, as well as also ability to maintain SLA
4. List any affected packages or components.
not sure