Loading...

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: AI Testing + Workflow Validation
Labels:
- GPUaaS

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
GPUaaS Technology Research & Hands-On Evaluation
Intelligence Requested:
Market:

Sprint:
Workflow Validation Sprint 27

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Objective

Define a structured, hands-on test plan that translates the GPUaaS Product Requirements into concrete, executable validation scenarios.

This story does not evaluate technologies yet!!!

Its purpose is to define exactly which tests must be executed in order to evaluate each candidate solution in a consistent and objective way.

The output of this story will serve as the operational backbone for the GPUaaS technology evaluation epic.

Background

We have a clearly defined, vendor-agnostic GPUaaS Requirements Document.

We are now entering the hands-on evaluation phase for:

• Volcano Operator

• GPUaaS on OpenShift

• Kubernetes Kueue

• Vanilla Kubernetes DRA

To ensure fair and objective comparison, we must define a unified validation test framework before running experiments.

Without a predefined test structure, evaluations risk becoming opinion-based or inconsistent.

This document will transform the PRD into an executable validation matrix.

Scope

Create a single structured test document that:

Maps each GPUaaS requirement (MUST / SHOULD / MAY) to one or more concrete validation tests.
Defines reproducible workload scenarios.
Specifies expected system behavior.
Defines measurable success criteria.
Defines required observability and logging validation.
Defines failure and edge-case scenarios.

This document must be technology-agnostic in design but technology-specific in execution.

All participants evaluate all technologies using the same test plan.

What the Test Framework Must Define

For each relevant requirement category, define:

Test Scenario Name

Cluster Setup

Quota & Priority Configuration

Workload Type

Conflict Trigger

Expected Behavior

Validation Method

Metrics / Logs to Inspect

Pass / Fail Criteria

etc

Mandatory Test Categories

The document must include (at minimum):

Scheduling & Priority

• Priority-aware scheduling validation

• Deterministic preemption behavior

• Graceful preemption validation

• Fairness within same priority tier

• Opportunistic scheduling behavior

• Starvation protection within opportunistic tier

• Gang scheduling correctness

• Queue-based scheduling behavior

Quota & Policy Enforcement

• Namespace-level quota enforcement

• Quota + priority bounded scheduling

• RBAC enforcement on priority submission

• Quota exhaustion handling

• Time-bound reservation behavior (if supported)

Observability & Transparency

• Preemption event logging

• Structured preemption logs

• Insufficient resources logging

• Human-readable failure reason exposure

• Resource contention metrics

• Per-team usage visibility

• Historical usage tracking

• Dashboard consistency

queue size & queue visiblity

Reliability & Determinism

• Controller HA behavior (if applicable)

• Scheduler deterministic behavior under identical inputs

• Failure recovery after node loss

• Behavior under high contention

Fragmentation & Topology

• Node-level fragmentation scenario (multi-node allocation deadlock)

• Multi-GPU contiguous allocation validation

• Topology-aware scheduling validation (NVLink / NUMA if supported)

Operational Constraints

• Behavior with existing clusters (opt-in mode)

• Recognition of GPU workloads not submitted via the scheduler

• Barrier to entry assessment for cluster owners

Open Design Experiments To Include

The test plan must also define experiments for:

• Opportunistic fair-share behavior

• Idle GPU reclaim policy validation

• Manual preemption override

• Cross-namespace interference scenarios

etc

Deliverable

A single consolidated “GPUaaS Validation Test Plan” document containing:

Structured test definitions per requirement

Reproducible workload descriptions

Conflict-based scenarios
Measurable validation criteria
Explicit pass/fail logic

This document must be sufficiently detailed so that any engineer can execute the tests consistently across technologies.

Out of Scope

• Running the experiments

• Selecting the "winning" technology

• Defining GPUaaS architecture

This story defines the test framework only.

DoD

This story is complete when:

• Every MUST requirement (from here https://docs.google.com/document/d/1eGLwzmlK115DGoaxTlHhR1jwLjQ1f6oxbegWI2c6Bfg/edit?tab=t.0) has at least one explicit executable validation scenario

• All test scenarios include measurable success criteria

• The document enables consistent comparison across Volcano, Kueue, DRA, and OpenShift-based GPUaaS

• The team aligns on the final version and confirms readiness to begin hands-on evaluation

is cloned by

AIPCC-10984 Define GPUaaS Validation Test Framework (Technology Evaluation Test Plan)

New

Details

Description

Objective

Background

Scope

What the Test Framework Must Define

Mandatory Test Categories

Open Design Experiments To Include

Deliverable

Out of Scope

DoD

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty