Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10980

Define GPUaaS Validation Test Framework (Technology Evaluation Test Plan)

    • Workflow Validation Sprint 27

      Objective

      Define a structured, hands-on test plan that translates the GPUaaS Product Requirements into concrete, executable validation scenarios.

      This story does not evaluate technologies yet!!!

      Its purpose is to define exactly which tests must be executed in order to evaluate each candidate solution in a consistent and objective way.

      The output of this story will serve as the operational backbone for the GPUaaS technology evaluation epic.

       


       

      Background

      We have a clearly defined, vendor-agnostic GPUaaS Requirements Document.

      We are now entering the hands-on evaluation phase for:

      • Volcano Operator

      • GPUaaS on OpenShift

      • Kubernetes Kueue

      • Vanilla Kubernetes DRA

       

      To ensure fair and objective comparison, we must define a unified validation test framework before running experiments.

      Without a predefined test structure, evaluations risk becoming opinion-based or inconsistent.

      This document will transform the PRD into an executable validation matrix.

       


      Scope

      Create a single structured test document that:

      1. Maps each GPUaaS requirement (MUST / SHOULD / MAY) to one or more concrete validation tests.
      2. Defines reproducible workload scenarios.
      3. Specifies expected system behavior.
      4. Defines measurable success criteria.
      5. Defines required observability and logging validation.
      6. Defines failure and edge-case scenarios.

       

      This document must be technology-agnostic in design but technology-specific in execution.

      All participants evaluate all technologies using the same test plan.

       

      What the Test Framework Must Define

      For each relevant requirement category, define:

      Test Scenario Name

      Cluster Setup

      Quota & Priority Configuration

      Workload Type

      Conflict Trigger

      Expected Behavior

      Validation Method

      Metrics / Logs to Inspect

      Pass / Fail Criteria

      etc

       


       

      Mandatory Test Categories

      The document must include (at minimum):

       

      Scheduling & Priority

      • Priority-aware scheduling validation

      • Deterministic preemption behavior

      • Graceful preemption validation

      • Fairness within same priority tier

      • Opportunistic scheduling behavior

      • Starvation protection within opportunistic tier

      • Gang scheduling correctness

      • Queue-based scheduling behavior

       

      Quota & Policy Enforcement

      • Namespace-level quota enforcement

      • Quota + priority bounded scheduling

      • RBAC enforcement on priority submission

      • Quota exhaustion handling

      • Time-bound reservation behavior (if supported)

       

      Observability & Transparency

      • Preemption event logging

      • Structured preemption logs

      • Insufficient resources logging

      • Human-readable failure reason exposure

      • Resource contention metrics

      • Per-team usage visibility

      • Historical usage tracking

      • Dashboard consistency

      • queue size & queue visiblity 

       

      Reliability & Determinism

      • Controller HA behavior (if applicable)

      • Scheduler deterministic behavior under identical inputs

      • Failure recovery after node loss

      • Behavior under high contention

       

      Fragmentation & Topology

      • Node-level fragmentation scenario (multi-node allocation deadlock)

      • Multi-GPU contiguous allocation validation

      • Topology-aware scheduling validation (NVLink / NUMA if supported)

       

      Operational Constraints

      • Behavior with existing clusters (opt-in mode)

      • Recognition of GPU workloads not submitted via the scheduler

      • Barrier to entry assessment for cluster owners

       


       

      Open Design Experiments To Include

      The test plan must also define experiments for:

      • Opportunistic fair-share behavior

      • Idle GPU reclaim policy validation

      • Manual preemption override

      • Cross-namespace interference scenarios

      etc

       


       

      Deliverable

       

      A single consolidated “GPUaaS Validation Test Plan” document containing:

      1. Structured test definitions per requirement
      1. Reproducible workload descriptions
      1. Conflict-based scenarios
      2. Measurable validation criteria
      3. Explicit pass/fail logic

      This document must be sufficiently detailed so that any engineer can execute the tests consistently across technologies.

       


      Out of Scope

      • Running the experiments

      • Selecting the "winning" technology

      • Defining GPUaaS architecture

      This story defines the test framework only.

       


       

      DoD

      This story is complete when:

      • Every MUST requirement (from here https://docs.google.com/document/d/1eGLwzmlK115DGoaxTlHhR1jwLjQ1f6oxbegWI2c6Bfg/edit?tab=t.0) has at least one explicit executable validation scenario

      • All test scenarios include measurable success criteria

      • The document enables consistent comparison across Volcano, Kueue, DRA, and OpenShift-based GPUaaS

      • The team aligns on the final version and confirms readiness to begin hands-on evaluation

              wspinks@redhat.com Wesley Spinks
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: