-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
Blocked By: Story 10980 – GPUaaS Validation Test Framework Definition
This story may begin only after Story 10980 is fully completed and approved.
Context
Over the past weeks we have established a structured foundation for the GPUaaS initiative.
We already have:
A technology-agnostic requirements definition describing what GPUaaS must support across scheduling, quotas, observability, fairness, reliability, fragmentation handling, and multi-cluster behavior.
This document defines the what and the why — not the implementation.
- GPUaaS Vision & Historical Context{}
A clear understanding of the operational problem we are solving, based on real experience managing GPUs at scale.
We know what good looks like: priority scheduling, opportunistic usage, reclaim policies, visibility, and data-driven control.
- Cross-Team Orchestration Strategy (AIPCC){}
A defined roadmap toward shared GPU orchestration across teams, including controller logic, queues, observability dashboards, and multi-cluster intelligence.
- Story 10980 – GPUaaS Validation Test Framework{}
A formal, structured test plan that translates the PRD into executable validation scenarios.
This ensures we evaluate technologies based on controlled experiments rather than opinion.
Why This Story Is Critical
At this stage, we are moving from theory to evidence.
We are evaluating multiple technologies:
- Volcano Operator
- GPUaaS on OpenShift
- Kubernetes Kueue
- Vanilla Kubernetes DRA
Without structured execution:
• Evaluations become subjective
• Discussions become opinion-based
• Gaps remain hidden
• Architecture decisions become risky
This story ensures:
• All technologies are evaluated using the exact same test scenarios
• All MUST requirements are validated under contention
• Fragmentation and edge cases are intentionally reproduced
• Observability and logging behavior are inspected
• Determinism and fairness are tested in practice
The outcome will allow us to produce a factual comparison matrix:
Requirement × Technology → Supported / Partially Supported / Not Supported
This becomes the foundation for the GPUaaS architecture phase.
No architecture decision should be made without completing this execution phase.
Objective
Execute the approved GPUaaS Validation Test Plan (10980) across all candidate technologies and produce objective, evidence-based results.
This story operationalizes the framework created in Story 10980.
Technologies & Execution Owners
Primary execution ownership:
Volcano Operator – Amit
GPUaaS on OpenShift – Wes
Kubernetes Kueue – Wes, Vikash, Jose
Vanilla Kubernetes DRA – Shared ownership (to be aligned)
Primary owners are responsible for:
Deployment
Configuration
Scenario execution
First-level documentation
All results require cross-review.
Scope
For each technology:
Deploy and configure the solution in a shared test environment
Implement quota and priority policies as defined in the test framework
Execute all validation scenarios defined in Story 10980
Capture logs, metrics, scheduling events, and system behavior
Compare expected vs observed behavior
Classify each requirement as:
Supported
Partially Supported
Not Supported
+ Notes /logs if needed!
Document operational complexity and hidden constraints
Execution Principles
No assumptions. Only observed behavior.
All MUST requirements must be validated.
Contention must be intentionally created.
Fragmentation scenarios must be reproduced.
Preemption must be tested under real conflict.
Observability must be validated end-to-end.
If behavior cannot be demonstrated, it is considered not supported.
Required Outputs Per Technology
Each owner must deliver:
Deployment Summary
Cluster setup details
Operators / CRDs used
Policy configuration approach
Execution Results
Expected vs Observed behavior per requirement
Clear support classification
Evidence
Logs
Events
Metrics
Screenshots
Gap Analysis
Missing features
Operational friction
Non-deterministic behavior
Workarounds required
Risk Assessment
Stability
Complexity
Production-readiness indicators
Consolidated Deliverable
At completion, a single comparison document must exist containing:
Requirement × Technology support matrix
Behavioral notes
Preemption comparison
Fragmentation handling comparison
Observability maturity comparison
Operational complexity comparison
Explicit gap and risk summary
This document becomes the formal baseline for the GPUaaS Architecture Definition phase.
DoD
Story 10980 completed and approved prior to execution
All technologies tested hands-on
Every MUST requirement mapped to real support status
Evidence exists for critical behaviors
Single consolidated comparison matrix exists
Alignment session completed
Architecture phase can begin based on factual data