-
Epic
-
Resolution: Unresolved
-
Critical
-
None
Epic Goal
To provide an add-on for RHACM that allows a user to enable RHBoK on their hub and managed clusters (install Kueue operator, creates an AdmissionCheckController on the hub, ensure required managed add-ons are present and enabled) and provide a mechanism to convert Placement and Placement features such as AddonPlacementScore into MultiKueueConfig and MultiKueueCluster.
Additionally, the add-on should provide support for both operator-based installs (OCP) and non-operator installs (Kubernetes via Helm).
Why is this important?
This makes it easier for customers to use Kueue (via multikueue) for batch processing for AI in a multicluster (ie >1 cluster) environmentm utilising familiar OCM concepts and APIs. They do not need to learn multikueue to use it right away. Placement is converted to multikueue config for them. Support for both an operator and helm install is essential for true multicluster scenarios.
Scenarios
As an owner of AI workloads I'd like to utilise OCM APIs, such as Placement, to deploy my batch jobs to multiple, targeted, nodes across many clusters via mulitikueue.
As a multi cluster environment administrator I'd like to provide my users the easiest way to deploy batch jobs from Kueue with minimal installation hassle allowing them to utilise Placement scores to target specific criteria on cluster nodes such as GPU type, CPU type, etc.
Acceptance Criteria
With the installation of the add-on the RHBoK operator is installed and (on the hub) the AdmissionController is deployed and ready to accept Placements.
Dependencies (internal and external)
- RHBoK exists.
Previous Work (Optional):
- OCM upstream: https://github.com/open-cluster-management-io/ocm/tree/main/solutions/kueue-admission-check
Open questions:
- How to best deploy to non-OCP and what that looks like.
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue> - DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue>
OCP/Telco Definition of Done
https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde
Epic Template descriptions and documentation.
https://docs.google.com/document/d/14CUCEg6hQ_jpsFzJtWo29GfFVWmun2Uivrxq3_Fkgdg/edit
ACM-wide Product Requirements (Top-level Epics)
https://docs.google.com/document/d/1uIp6nS2QZ766UFuZBaC9USs8dW_I5wVdtYF9sUObYKg/edit
*<--- Cut-n-Paste the entire contents of this description into your new
Epic --->*
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Doc issue opened with a completed template. Separate doc issue
opened for any deprecation, removal, or any current known
issue/troubleshooting removal from the doc, if applicable. - Considerations were made for Extended Update Support (EUS)
- clones
-
ACM-20495 [DP] Create a RHBoK (Kueue Operator) add-on for RHACM
-
- Closed
-