-
Feature Request
-
Resolution: Won't Do
-
Major
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
Red Hat OpenShift Container Platform
-
None
-
-
None
-
-
None
-
None
-
None
-
None
-
None
Proposed title of this feature request
Support for Node-Specific Pod Scheduling by human assignment
What is the nature and description of the request?
To facilitate the usage of the virtual batch-based upgrade solution, this feature is a best-have to allow Telco CNF workloads to be able to schedule on the designated node by human assignment before performing batch-based upgrade solution. This optimization reduces the number of virtual groups (aka batches), thereby achieving the shortest possible parallel upgrade maintenance window.
Why does the customer need this? (List the business requirements here)
The node-specific pod scheduling is an efficient way for batch-based upgrade in ISSU solutions.
According to the estimation, if the feature is implemented, for the current largest 472-worker-nodes cluster, since pods are randomly scheduled, the initial batch count may be around 20 or more, customer would like to reduce to 10-12 batches, with 40+ nodes per batch. Considering each node takes 30 minutes to finish upgrading, that means 6-8 batches could be performed in one maintenance window (Nodes in the same batch can be upgraded together without worrying about business disruption). In this way, customers can finish the super-large cluster in 2 maintenance windows, which will greatly reduce the cost of upgrade.
Currently, without this feature, when a customer wants to schedule a pod onto a specific node to minimize the number of batches, they can only do so indirectly by temporarily disabling scheduling on unrelated nodes and deleting the pod multiple times, hoping that—by chance—the pod will eventually be scheduled onto the target node. This process requires executing scheduling prohibition commands on hundreds of nodes simultaneously, posing significant risks in commercial environments and is extremely inconvenient and inefficient.
The customer does not want to use hard scheduling policies such as nodeSelector, as that would reduce pod flexibility. In other words, the pod spec will not include a specific target node; it will only use relatively soft affinity/anti-affinity rules to ensure that the pods of a StatefulSet are not scheduled on the same node for high availability, but without hard-coding a particular node (or a node with a specific label).
This feature is intended solely as a temporary measure during upgrades, not a long-term configuration. They do not want any environment-specific labels or annotations added to the pod spec. Likewise, they will not add special labels or annotations to nodes during upgrades.
This issue arises in the context of the "big rock, small rock" problem — when pods of different resource sizes are deployed together.
For example, if both a large pod (8 cores / 16 GiB memory — a big rock) and a smaller pod (2 cores / 4 GiB memory — a small rock) are scheduled, and the small pod is deployed first, it may occupy enough resources on a node to prevent the large pod from being placed there due to insufficient remaining capacity.
In such cases, it becomes necessary to schedule the small pod to another node first, allowing the large pod to be scheduled and placed successfully. After the large pod is deployed, the small pod can be rescheduled back, resulting in an optimized placement strategy that makes better use of available node resources.
Ideal solution
During the upgrade process, the customer should be able to select a specific pod in the web console or via CLI, and then assign it directly to a specific node — effectively forcing the pod to migrate there.
This behavior would override the current scheduler’s default policies, and the manually specified scheduling would take precedence. It functions similarly to the VM live migration feature in OpenStack, where an administrator can manually designate the destination node for a virtual machine.
An example of how customer is doing this today:
Assume the current state (where a and b are mutually exclusive):
- Batch 1 (worker1–3): pod-1-a, pod-2-a, pod-4-b
- Batch 2 (worker4): pod-1-b, pod-2-b, pod-3-a → to be adjusted
- Batch 3 (worker5–8): pod-3-b, pod-4-a
The goal now is to move the pods on Batch 2 (worker4) that conflict with Batch 1 into Batch 3 (worker5–8), and merge the non-conflicting ones into Batch 1 (worker1–3).
To achieve the target state:
- Adjust pod-1-b first. Since pod-1-b conflicts with pod-1-a, it cannot join Batch 1. There are two ways to adjust:
- option 1: disable scheduling on worker4 and worker1-3, and enable on worker5-8. This ensures that deleting pod-1-b once will place it onto one of worker5-8. This operation needs to be executed on many nodes, and disabling scheduling will raise the risk in commercial environments. For the largest 472-worker-node cluster, this way is very unfriendly and is a disaster for on-site personnel.
- option 2: Design a function on OCP web console or CLI, directly move pod-1-b to target node. This operation only needs to be done once and is very convenient.
- pod-2-b follows the same procedure as pod-1-b.
- For pod-3-a, since it conflicts with Batch 3 but not with Batch 1,it can remain on worker4, batch 1 and batch 2 can be merged together to reduce the number of batches.
After the adjustment:
- Batch 1: worker1-4: pod-1-a, pod-2-a, pod-3-a, pod-4-b
- Batch 2: worker5–8: pod-4-a, pod-1-b, pod-2-b, pod-3-b
In this way, worker1–4 can be merged as Batch 1, reducing the number of batches by one.
List any affected packages or components.
Scheduler. The feature would require enhancements to the scheduler to support pod scheduling to specific nodes.