Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: AI, ai-ml-workloads
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-1692AI Workloads for OpenShift
Status Summary:

Hide

Status: Green
Feature team is on RIT this sprint. No new updates.

Show
Status: Green Feature team is on RIT this sprint. No new updates.
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Currently, the default scheduler in OpenShift handles jobs sequentially as they arrive, which is suitable for many applications but not for certain AI/ML workloads. These workloads often consist of multiple interdependent jobs that must run simultaneously to operate correctly (i.e., an "all-or-nothing" requirement). If these jobs cannot all be scheduled together, the workload fails to function as intended.

The proposed Gang Scheduler will enhance OpenShift's scheduling capabilities by recognizing and handling groups of jobs (or "gangs") as a unified scheduling entity. This scheduler will ensure that all jobs within a defined gang are scheduled together. If resources are not currently available to accommodate all the jobs in the gang, the scheduler will delay the gang until sufficient resources are available. This all-at-once scheduling strategy will allow AI/ML workloads to run as needed without partial resource allocation, supporting high coordination requirements essential to complex workloads.

Example Scenario

For an AI/ML pipeline with multiple interdependent jobs, the Gang Scheduler would assess resource availability for the entire group.
If resources to accommodate the gang are insufficient, the scheduler will not partially schedule the jobs. Instead, it will wait until the full resource set is available, enabling all jobs to start together as required.

This feature will provide critical support for resource-intensive, tightly coupled workloads, enhancing OpenShift's capabilities for AI/ML applications and other workloads that rely on gang scheduling.

is depended on by

RFE-8495 Support Topology-Aware Gang Scheduling

Backlog

is related to

RFE-8233 Enable KAI Schedule on RHOAI

Closed

links to

[Upstream gDoc] Gang Scheduling Support In Kubernetes

[Upstream] Gang Scheduling Support in Kubernetes

Assignee:: Gaurav Singh

Reporter:: Gaurav Singh

Need Info From:: None

Contributors:: Ju Lim, Kevin Hannon, Mrunal Patel

Architect:: Mrunal Patel

QA Contact:: Rahul Gangwar

Doc Contact:: Matthew Werner

Product Operations Engineering Contact:: Eric Rich

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2024/11/12 1:23 PM

Updated:: 2025/11/20 4:07 AM

Details

Description

Feature Overview (aka. Goal Summary)

Example Scenario

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates