Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-10323

investigate ocm(placement) + AI workload

XMLWordPrintable

    • False
    • None
    • False
    • Hide

      Provide the required acceptance criteria using this template.
      * ...
      Show
      Provide the required acceptance criteria using this template. * ...
    • No

      Value Statement

      Google announced their Dynamic Workload Scheduler which could manage the resource access for AI/ML workloads. https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler

      Dynamic Workload Scheduler is through orchestrators such as Kueue. Popular ML frameworks such as Ray, Kubeflow, Flux, PyTorch and other training operators are supported out of the box.

      Kueue https://github.com/kubernetes-sigs/kueue is a job queueing project, it's Provisioning Admission Check controller can integrate with other cluster autoscaler, it's MultiKueue Admission Check Controller is to support  multi cluster job dispatching.

      In this Spike we want to use Kueue as an example, investigate the popular AI workload dispatching project, the use cases for multi clusters, how OCM can join in. 

      Definition of Done for Engineering Story Owner (Checklist)

      • ...

      Development Complete

      • The code is complete.
      • Functionality is working.
      • Any required downstream Docker file changes are made.

      Tests Automated

      • [ ] Unit/function tests have been automated and incorporated into the
        build.
      • [ ] 100% automated unit/function test coverage for new or changed APIs.

      Secure Design

      • [ ] Security has been assessed and incorporated into your threat model.

      Multidisciplinary Teams Readiness

      Support Readiness

      • [ ] The must-gather script has been updated.

            qhao@redhat.com Qing Hao
            qhao@redhat.com Qing Hao
            Hui Chen Hui Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: