Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1308

[RHOAI Distributed Workloads] Stress test the Kubeflow training operator & Kueue scheduler

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • RHOAI
    • None
    • Stress test the Kubeflow & Kueue operators
    • MLOps, RHOAI, Training
    • Not Selected
    • False
    • False
    • None
    • 40% To Do, 0% In Progress, 60% Done

      Epic Goal

      • Design, implement and run a stress test for the Kubeflow training operator
      • Integrate the Kubeflow training operator stress test in a continuous performance testing pipeline for regression analyses
      • Design, implement and run a stress test for the Kueue scheduler
      • Integrate the Kueue scheduler stress test in a continuous performance testing pipeline for regression analyses

      Current focus in on the Kueue scheduler.

      Why is this important?

      • These components are getting integrated into RHOAI.
      • They are critical for the efficiency of the distributed workload components

      Deadlines / timeframe

      • Kubeflow training operator --> final build by the end of June
      • Kueue --> due to be GA for summit/ RHOAI 2.10.

      Previous Work (Optional):

      1. MCAD scheduler scale test
      2. Scale your Batch / AI Workloads beyond the Kubernetes Scheduler

      Discussions

      Deliverables

       

              kpouget2 Kevin Pouget
              kpouget2 Kevin Pouget
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: