-
Epic
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
GPUaaS Technology Research & Hands-On Evaluation
-
False
-
-
False
-
In Progress
-
AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
-
50% To Do, 19% In Progress, 31% Done
This document defines the functional and operational requirements for the GPUaaS initiative.
Its purpose is to help guide what to research, what to read, and what to validate when evaluating different technologies.
The requirements serve as a clear set of principles and expectations that any candidate solution should be measured against.
https://docs.google.com/document/d/1eGLwzmlK115DGoaxTlHhR1jwLjQ1f6oxbegWI2c6Bfg/edit?tab=t.0
Context
Following the GPUaaS requirements definition, we need to validate how existing technologies actually behave in practice.
This epic bridges the gap between theoretical requirements and real-world capabilities by combining hands-on experimentation with structured analysis.
The outcome of this epic is a single, consolidated document that maps requirements to real system behavior.
Objective
Produce one consolidated evaluation document that combines:
- The GPUaaS requirements defined in the previous epic
- Hands-on experimentation results
- A clear matrix showing which technology supports which requirement, and how
This document will serve as the factual baseline for the GPUaaS architecture definition in the next phase.
Technologies to Evaluate
- Run:AI
- Kubernetes Kueue
- Kubernetes DRA
- GPU-related capabilities in OpenShift AI (for start you can watch https://drive.google.com/file/d/1wW9n3MkMLp8HvXGqPAYHaqiyp-V_8A5b/view) and ping the relevant people
- Additional relevant technologies identified during research
Additional Scope – Research
- As part of this research task, also investigate the internal GPU resource management solution used in RDU4 (Red Hat’s internal data center in Raleigh/Durham). This should include how GPUs are managed, allocated, and monitored today.
-
- Contacts: Brian Cinque, Maria Bellman
- Be aware >> Rich Hardy said we already have an IT service based on the GPUaaS toolset in OpenShift AI called MOSAIC.
https://source.redhat.com/departments/it/ai_platforms/mosaic_platform
- Be aware >> Rich Hardy said we already have an IT service based on the GPUaaS toolset in OpenShift AI called MOSAIC.
- Contacts: Brian Cinque, Maria Bellman
-
- In addition, coordinate with the IBM Cloud team to understand how GPU resources are managed on IBM Cloud.
- Contact: Kieran Forde
Important Principle
All participants evaluate all technologies.
The goal is collective understanding, not divided ownership or isolated expertise.
Scope
In Scope
- Deploying each candidate solution
- Hands-on experimentation using real workloads
-
- Quotas
-
- Priorities
-
- Preemption
-
- Operational complexity
Evaluating practical behavior such as:
- Comparing observed behavior directly against the defined GPUaaS requirements in the previous steps
- Building a structured support matrix (requirement × technology)
Out of Scope
- Defining GPUaaS architecture
- Selecting a final implementation
- Production rollout
Execution
- Use a shared test environment
- Run experiments locally or on the Model Validation GCP environment
- Deploy, configure, and actively use each solution
- Document findings based on observed behavior, not assumptions
Deliverables
- A single consolidated evaluation document containing:
-
- Hands-on findings per technology
-
- A requirements support matrix (supported / partially supported / not supported)
-
- Operational insights and limitations
- Clear identification of gaps and risks per solution
DoD
This epic is complete when:
- All candidate technologies were evaluated hands-on
- Each GPUaaS requirement is mapped to real support status per technology
- A single comparison matrix document exists
Notes
This epic does not define architecture, but its output is explicitly intended to be the input and foundation for the architecture epic that follows.