-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Product / Portfolio Work
-
-
-
False
-
-
False
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview (aka. Goal Summary)
Currently, the default scheduler in OpenShift handles jobs sequentially as they arrive, which is suitable for many applications but not for certain AI/ML workloads. These workloads often consist of multiple interdependent jobs that must run simultaneously to operate correctly (i.e., an "all-or-nothing" requirement). If these jobs cannot all be scheduled together, the workload fails to function as intended.
The proposed Gang Scheduler will enhance OpenShift's scheduling capabilities by recognizing and handling groups of jobs (or "gangs") as a unified scheduling entity. This scheduler will ensure that all jobs within a defined gang are scheduled together. If resources are not currently available to accommodate all the jobs in the gang, the scheduler will delay the gang until sufficient resources are available. This all-at-once scheduling strategy will allow AI/ML workloads to run as needed without partial resource allocation, supporting high coordination requirements essential to complex workloads.
Example Scenario
- For an AI/ML pipeline with multiple interdependent jobs, the Gang Scheduler would assess resource availability for the entire group.
- If resources to accommodate the gang are insufficient, the scheduler will not partially schedule the jobs. Instead, it will wait until the full resource set is available, enabling all jobs to start together as required.
This feature will provide critical support for resource-intensive, tightly coupled workloads, enhancing OpenShift's capabilities for AI/ML applications and other workloads that rely on gang scheduling.