Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10981

Execute GPUaaS Hands-On Validation Across Candidate Technologies

XMLWordPrintable

      Blocked By: Story 10980 – GPUaaS Validation Test Framework Definition

      This story may begin only after Story 10980 is fully completed and approved.

       


      Context

      Over the past weeks we have established a structured foundation for the GPUaaS initiative.

      We already have:

      1. GPUaaS Product Requirements Document (PRD){}

      A technology-agnostic requirements definition describing what GPUaaS must support across scheduling, quotas, observability, fairness, reliability, fragmentation handling, and multi-cluster behavior.

      This document defines the what and the why — not the implementation.

      1. GPUaaS Vision & Historical Context{}

      A clear understanding of the operational problem we are solving, based on real experience managing GPUs at scale.

      We know what good looks like: priority scheduling, opportunistic usage, reclaim policies, visibility, and data-driven control.

      1. Cross-Team Orchestration Strategy (AIPCC){}

      A defined roadmap toward shared GPU orchestration across teams, including controller logic, queues, observability dashboards, and multi-cluster intelligence.

      1. Story 10980 – GPUaaS Validation Test Framework{}

      A formal, structured test plan that translates the PRD into executable validation scenarios.

      This ensures we evaluate technologies based on controlled experiments rather than opinion.


      Why This Story Is Critical

      At this stage, we are moving from theory to evidence.

      We are evaluating multiple technologies:

      • Volcano Operator
      • GPUaaS on OpenShift
      • Kubernetes Kueue
      • Vanilla Kubernetes DRA

       

      Without structured execution:

      • Evaluations become subjective

      • Discussions become opinion-based

      • Gaps remain hidden

      • Architecture decisions become risky

       

      This story ensures:

      • All technologies are evaluated using the exact same test scenarios

      • All MUST requirements are validated under contention

      • Fragmentation and edge cases are intentionally reproduced

      • Observability and logging behavior are inspected

      • Determinism and fairness are tested in practice

       

      The outcome will allow us to produce a factual comparison matrix:

      Requirement × Technology → Supported / Partially Supported / Not Supported

      This becomes the foundation for the GPUaaS architecture phase.

      No architecture decision should be made without completing this execution phase.

       


      Objective

      Execute the approved GPUaaS Validation Test Plan (10980) across all candidate technologies and produce objective, evidence-based results.

      This story operationalizes the framework created in Story 10980.

       


      Technologies & Execution Owners

      Primary execution ownership:

      Volcano Operator – Amit

      GPUaaS on OpenShift – Wes

      Kubernetes Kueue – Wes, Vikash, Jose

      Vanilla Kubernetes DRA – Shared ownership (to be aligned)

       

      Primary owners are responsible for:

      Deployment

      Configuration

      Scenario execution

      First-level documentation

      All results require cross-review.


      Scope

      For each technology:

      Deploy and configure the solution in a shared test environment

      Implement quota and priority policies as defined in the test framework

      Execute all validation scenarios defined in Story 10980

      Capture logs, metrics, scheduling events, and system behavior

      Compare expected vs observed behavior

      Classify each requirement as:

      Supported

      Partially Supported

      Not Supported

      + Notes /logs if needed! 

      Document operational complexity and hidden constraints

       


      Execution Principles

      No assumptions. Only observed behavior.

      All MUST requirements must be validated.

      Contention must be intentionally created.

      Fragmentation scenarios must be reproduced.

      Preemption must be tested under real conflict.

      Observability must be validated end-to-end.

      If behavior cannot be demonstrated, it is considered not supported.

       


      Required Outputs Per Technology

      Each owner must deliver:

      Deployment Summary

      Cluster setup details

      Operators / CRDs used

      Policy configuration approach

       

      Execution Results

      Expected vs Observed behavior per requirement

      Clear support classification

       

      Evidence

      Logs

      Events

      Metrics

      Screenshots

       

      Gap Analysis

      Missing features

      Operational friction

      Non-deterministic behavior

      Workarounds required

       

      Risk Assessment

      Stability

      Complexity

      Production-readiness indicators

       


      Consolidated Deliverable

      At completion, a single comparison document must exist containing:

      Requirement × Technology support matrix

      Behavioral notes

      Preemption comparison

      Fragmentation handling comparison

      Observability maturity comparison

      Operational complexity comparison

      Explicit gap and risk summary

      This document becomes the formal baseline for the GPUaaS Architecture Definition phase.

       


      DoD

      Story 10980 completed and approved prior to execution

      All technologies tested hands-on

      Every MUST requirement mapped to real support status

      Evidence exists for critical behaviors

      Single consolidated comparison matrix exists

      Alignment session completed

      Architecture phase can begin based on factual data

              wspinks@redhat.com Wesley Spinks
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: