-
Spike
-
Resolution: Done
-
Normal
-
None
-
ACM 2.10.0
-
False
-
None
-
False
-
-
-
-
No
Value Statement
Google announced their Dynamic Workload Scheduler which could manage the resource access for AI/ML workloads. https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler
Dynamic Workload Scheduler is through orchestrators such as Kueue. Popular ML frameworks such as Ray, Kubeflow, Flux, PyTorch and other training operators are supported out of the box.
Kueue https://github.com/kubernetes-sigs/kueue is a job queueing project, it's Provisioning Admission Check controller can integrate with other cluster autoscaler, it's MultiKueue Admission Check Controller is to support multi cluster job dispatching.
In this Spike we want to use Kueue as an example, investigate the popular AI workload dispatching project, the use cases for multi clusters, how OCM can join in.
Definition of Done for Engineering Story Owner (Checklist)
- ...
Development Complete
- The code is complete.
- Functionality is working.
- Any required downstream Docker file changes are made.
Tests Automated
- [ ] Unit/function tests have been automated and incorporated into the
build. - [ ] 100% automated unit/function test coverage for new or changed APIs.
Secure Design
- [ ] Security has been assessed and incorporated into your threat model.
Multidisciplinary Teams Readiness
- [ ] Create an informative documentation issue using the [Customer
Portal_doc_issue template](
https://github.com/stolostron/backlog/issues/new?assignees=&labels=squad%3Adoc&template=doc_issue.md&title=),
and ensure doc acceptance criteria is met. Link the development issue to
the doc issue. - [ ] Provide input to the QE team, and ensure QE acceptance criteria
(established between story owner and QE focal) are met.
Support Readiness
- [ ] The must-gather script has been updated.