Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-9830

GPUaaS Technology Research & Hands-On Evaluation

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Model Validation
    • GPUaaS Technology Research & Hands-On Evaluation
    • False
    • Hide

      None

      Show
      None
    • False
    • In Progress
    • AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
    • 50% To Do, 19% In Progress, 31% Done

       

      This document defines the functional and operational requirements for the GPUaaS initiative.

      Its purpose is to help guide what to research, what to read, and what to validate when evaluating different technologies.

      The requirements serve as a clear set of principles and expectations that any candidate solution should be measured against.

      https://docs.google.com/document/d/1eGLwzmlK115DGoaxTlHhR1jwLjQ1f6oxbegWI2c6Bfg/edit?tab=t.0


      Context

      Following the GPUaaS requirements definition, we need to validate how existing technologies actually behave in practice.

       

      This epic bridges the gap between theoretical requirements and real-world capabilities by combining hands-on experimentation with structured analysis.

       

      The outcome of this epic is a single, consolidated document that maps requirements to real system behavior.

       


      Objective

      Produce one consolidated evaluation document that combines:

      • The GPUaaS requirements defined in the previous epic
      • Hands-on experimentation results
      • A clear matrix showing which technology supports which requirement, and how

      This document will serve as the factual baseline for the GPUaaS architecture definition in the next phase.

       


      Technologies to Evaluate

      • Run:AI
      • Kubernetes Kueue
      • Kubernetes DRA
      • Additional relevant technologies identified during research

       

      Additional Scope – Research

      • As part of this research task, also investigate the internal GPU resource management solution used in RDU4 (Red Hat’s internal data center in Raleigh/Durham). This should include how GPUs are managed, allocated, and monitored today.
      • In addition, coordinate with the IBM Cloud team to understand how GPU resources are managed on IBM Cloud.
        • Contact: Kieran Forde

       


      Important Principle

      All participants evaluate all technologies.

      The goal is collective understanding, not divided ownership or isolated expertise.

       


      Scope

      In Scope

      • Deploying each candidate solution
      • Hands-on experimentation using real workloads
        • Quotas
        • Priorities
        • Preemption
        • Operational complexity

      Evaluating practical behavior such as:

       

      • Comparing observed behavior directly against the defined GPUaaS requirements in the previous steps
      • Building a structured support matrix (requirement × technology) 

       

      Out of Scope

      • Defining GPUaaS architecture
      • Selecting a final implementation
      • Production rollout

       


      Execution

      • Use a shared test environment
      • Run experiments locally or on the Model Validation GCP environment
      • Deploy, configure, and actively use each solution
      • Document findings based on observed behavior, not assumptions

      Deliverables

      • A single consolidated evaluation document containing:
        • Hands-on findings per technology
        • A requirements support matrix (supported / partially supported / not supported)
        • Operational insights and limitations

       

      • Clear identification of gaps and risks per solution

       


      DoD

      This epic is complete when:

      • All candidate technologies were evaluated hands-on
      • Each GPUaaS requirement is mapped to real support status per technology
      • A single comparison matrix document exists

      Notes

      This epic does not define architecture, but its output is explicitly intended to be the input and foundation for the architecture epic that follows.

              rh-ee-abadli Aviran Badli
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: