• Product / Portfolio Work
    • OCPSTRAT-1692AI Workloads for OpenShift
    • Hide

      Status: Green
      The feature teams work upstream getting the proposal accepted is making good progress and the team will begin a PoC for the work that has been accepted.

      Show
      Status: Green The feature teams work upstream getting the proposal accepted is making good progress and the team will begin a PoC for the work that has been accepted.
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)  

      Currently, the default scheduler in OpenShift handles jobs sequentially as they arrive, which is suitable for many applications but not for certain AI/ML workloads. These workloads often consist of multiple interdependent jobs that must run simultaneously to operate correctly (i.e., an "all-or-nothing" requirement). If these jobs cannot all be scheduled together, the workload fails to function as intended.

      The proposed Gang Scheduler will enhance OpenShift's scheduling capabilities by recognizing and handling groups of jobs (or "gangs") as a unified scheduling entity. This scheduler will ensure that all jobs within a defined gang are scheduled together. If resources are not currently available to accommodate all the jobs in the gang, the scheduler will delay the gang until sufficient resources are available. This all-at-once scheduling strategy will allow AI/ML workloads to run as needed without partial resource allocation, supporting high coordination requirements essential to complex workloads.

      Example Scenario

      • For an AI/ML pipeline with multiple interdependent jobs, the Gang Scheduler would assess resource availability for the entire group.
      • If resources to accommodate the gang are insufficient, the scheduler will not partially schedule the jobs. Instead, it will wait until the full resource set is available, enabling all jobs to start together as required.

      This feature will provide critical support for resource-intensive, tightly coupled workloads, enhancing OpenShift's capabilities for AI/ML applications and other workloads that rely on gang scheduling.

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              None
              Ju Lim, Kevin Hannon, Mrunal Patel
              Mrunal Patel Mrunal Patel
              Rahul Gangwar Rahul Gangwar
              Matthew Werner Matthew Werner
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated: