Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-24624

[TP] Create a RHBoK (Kueue Operator) add-on for RHACM

XMLWordPrintable

    • Create a RHBoK (Kueue Operator) add-on for RHACM
    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • Green
    • In Progress
    • ACM-18472 - ACM Options to support RHBoK for multiKueue workloads using RHACM
    • ACM-18472ACM Options to support RHBoK for multiKueue workloads using RHACM

      Epic Goal

      To provide an add-on for RHACM that allows a user to enable RHBoK on their hub and managed clusters (install Kueue operator, creates an AdmissionCheckController on the hub, ensure required managed add-ons are present and enabled) and provide a mechanism to convert Placement and Placement features such as AddonPlacementScore into MultiKueueConfig and MultiKueueCluster.

      Additionally, the add-on should provide support for both operator-based installs (OCP) and non-operator installs (Kubernetes via Helm).

      Why is this important?

      This makes it easier for customers to use Kueue (via multikueue) for batch processing for AI in a multicluster (ie >1 cluster) environmentm utilising familiar OCM concepts and APIs. They do not need to learn multikueue to use it right away. Placement is converted to multikueue config for them. Support for both an operator and helm install is essential for true multicluster scenarios.

      Scenarios

      As an owner of AI workloads I'd like to utilise OCM APIs, such as Placement, to deploy my batch jobs to multiple, targeted, nodes across many clusters via mulitikueue.

      As a multi cluster environment administrator I'd like to provide my users the easiest way to deploy batch jobs from Kueue with minimal installation hassle allowing them to utilise Placement scores to target specific criteria on cluster nodes such as GPU type, CPU type, etc.

      Acceptance Criteria

      With the installation of the add-on the RHBoK operator is installed and (on the hub) the AdmissionController is deployed and ready to accept Placements.

      Dependencies (internal and external)

      1. RHBoK exists.

      Previous Work (Optional):

      1. OCM upstream: https://github.com/open-cluster-management-io/ocm/tree/main/solutions/kueue-admission-check

      Open questions:

      1. How to best deploy to non-OCP and what that looks like.

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
        Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
        Issue>

      OCP/Telco Definition of Done
      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde
      Epic Template descriptions and documentation.
      https://docs.google.com/document/d/14CUCEg6hQ_jpsFzJtWo29GfFVWmun2Uivrxq3_Fkgdg/edit
      ACM-wide Product Requirements (Top-level Epics)
      https://docs.google.com/document/d/1uIp6nS2QZ766UFuZBaC9USs8dW_I5wVdtYF9sUObYKg/edit

      *<--- Cut-n-Paste the entire contents of this description into your new
      Epic --->*

      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Doc issue opened with a completed template. Separate doc issue
        opened for any deprecation, removal, or any current known
        issue/troubleshooting removal from the doc, if applicable.
      • Considerations were made for Extended Update Support (EUS)

              qhao@redhat.com Qing Hao
              asimonel August Simonelli
              Hui Chen Hui Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: