Loading...

Type: Epic
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Model Validation
Labels:
- GPUaaS

Epic Name:
GPUaaS Technology Research & Hands-On Evaluation
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Status:
In Progress
Parent Link:
AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
Hierarchy Progress Bar:

50% To Do, 19% In Progress, 31% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

This document defines the functional and operational requirements for the GPUaaS initiative.

Its purpose is to help guide what to research, what to read, and what to validate when evaluating different technologies.

The requirements serve as a clear set of principles and expectations that any candidate solution should be measured against.

https://docs.google.com/document/d/1eGLwzmlK115DGoaxTlHhR1jwLjQ1f6oxbegWI2c6Bfg/edit?tab=t.0

Context

Following the GPUaaS requirements definition, we need to validate how existing technologies actually behave in practice.

This epic bridges the gap between theoretical requirements and real-world capabilities by combining hands-on experimentation with structured analysis.

The outcome of this epic is a single, consolidated document that maps requirements to real system behavior.

Objective

Produce one consolidated evaluation document that combines:

The GPUaaS requirements defined in the previous epic

Hands-on experimentation results

A clear matrix showing which technology supports which requirement, and how

This document will serve as the factual baseline for the GPUaaS architecture definition in the next phase.

Technologies to Evaluate

Run:AI

Kubernetes Kueue

Kubernetes DRA

GPU-related capabilities in OpenShift AI (for start you can watch https://drive.google.com/file/d/1wW9n3MkMLp8HvXGqPAYHaqiyp-V_8A5b/view) and ping the relevant people

Additional relevant technologies identified during research

Additional Scope – Research

As part of this research task, also investigate the internal GPU resource management solution used in RDU4 (Red Hat’s internal data center in Raleigh/Durham). This should include how GPUs are managed, allocated, and monitored today.
- - Contacts: Brian Cinque, Maria Bellman
    - Be aware >> Rich Hardy said we already have an IT service based on the GPUaaS toolset in OpenShift AI called MOSAIC.
      
      https://source.redhat.com/departments/it/ai_platforms/mosaic_platform

In addition, coordinate with the IBM Cloud team to understand how GPU resources are managed on IBM Cloud.
- Contact: Kieran Forde

Important Principle

All participants evaluate all technologies.

The goal is collective understanding, not divided ownership or isolated expertise.

Scope

In Scope

Deploying each candidate solution

Hands-on experimentation using real workloads

- Quotas

- Priorities

- Preemption

- Operational complexity

Evaluating practical behavior such as:

Comparing observed behavior directly against the defined GPUaaS requirements in the previous steps

Building a structured support matrix (requirement × technology)

Out of Scope

Defining GPUaaS architecture

Selecting a final implementation

Production rollout

Execution

Use a shared test environment

Run experiments locally or on the Model Validation GCP environment

Deploy, configure, and actively use each solution

Document findings based on observed behavior, not assumptions

Deliverables

A single consolidated evaluation document containing:

- Hands-on findings per technology

- A requirements support matrix (supported / partially supported / not supported)

- Operational insights and limitations

Clear identification of gaps and risks per solution

DoD

This epic is complete when:

All candidate technologies were evaluated hands-on

Each GPUaaS requirement is mapped to real support status per technology

A single comparison matrix document exists

Notes

This epic does not define architecture, but its output is explicitly intended to be the input and foundation for the architecture epic that follows.

Details

Description