Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10157

GPUaaS Ecosystem Deep Research

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Model Validation

      Research Task – What You Are Expected To Do

      As part of this epic, each participant is required to actively read documentation and learning materials related to GPU scheduling and resource management systems, including (but not limited to):

      • Run:AI

      • Kubernetes Kueue

      • Kubernetes DRA

      • YARN-style schedulers and fairness models
      • Volcano 

      • Related concepts around GPU scheduling, quotas, priorities, and preemption

       

      The expectation is not passive reading.

      You are expected to study how these systems actually work, understand their concepts, and be able to reason about their behavior.

      This task is intentionally front-loaded with theory and reading. Hands-on experimentation comes later. The goal here is to build shared mental models before touching systems

      Note

      Why We Explicitly Study YARN and Volcano (Even Though They Are Not Kubernetes-Native)

      As part of this research, it is intentionally required to study systems such as YARN and Volcano, even though they are not native Kubernetes GPU schedulers.

      This is not accidental.

      YARN represents one of the most mature and battle-tested resource scheduling systems in large-scale distributed computing. Many of the concepts we discuss today in GPUaaS did not originate in Kubernetes. They were first explored, refined, and stress-tested in YARN and similar systems at massive scale.

       

      By studying YARN, we gain deep understanding of:

      • Fairness vs strict priority trade-offs

      • FIFO, LIFO, weighted fairness, and DRF scheduling models

      • Preemption strategies and their failure modes

      • How ownership and quotas evolve over time

      • Why certain “simple” scheduling ideas break at scale

       

      These concepts directly influence modern systems, even when they are implemented differently.

       

      Volcano is included because it is effectively the bridge between YARN-style batch scheduling and Kubernetes.

      It brings many of the same ideas (queues, priorities, fairness, gang scheduling) into the Kubernetes ecosystem and exposes where Kubernetes-native scheduling still struggles.

       

      Studying Volcano helps answer questions such as:

      • What happens when YARN ideas are mapped onto Kubernetes primitives?

      • Where does Kubernetes make things easier, and where does it make them harder?

      • Which concepts translate cleanly, and which do not?

       

      The goal is not to adopt YARN or Volcano.

       

      The goal is to understand the ecosystem of ideas behind GPU scheduling, fairness, quotas, and preemption, so that GPUaaS decisions are made consciously, with full awareness of historical lessons and known pitfalls.

       

      Skipping this context almost guarantees repeating old mistakes.

      Purpose of This Research

      The goal of this task is to build a strong conceptual foundation around the GPUaaS ecosystem before we start designing or implementing anything.

       

      GPUaaS is not a single feature or product.

      It is an intersection of scheduling theory, fairness models, quota enforcement, ownership semantics, preemption behavior, and GPU hardware constraints.

       

      Without a deep understanding of these concepts:

      • Technology comparisons become superficial

      • Architecture decisions are based on assumptions

      • Critical edge cases are discovered too late

       

      This research ensures that when we later evaluate technologies and define architecture, we are doing so based on real understanding, not terminology or marketing claims.


      Mandatory Research Checklist

      Each participant must complete all items below.

      This checklist is part of the Definition of Done for this epic.

      Conceptual Understanding

      • I understand the difference between priority-based scheduling and fairness-based scheduling{}
      • I can explain FIFO, LIFO, weighted fairness, and DRF in plain English
      • I understand the trade-offs between strict priority and fair sharing{}
      • I understand how preemption works conceptually and why it is hard to get right
      • I understand how GPU heterogeneity impacts scheduling decisions

      System-Level Understanding

      For each evaluated technology (Run:AI, Kueue, DRA, YARN-style schedulers):

      • I understand how scheduling decisions are made
      • I understand how priorities are expressed and enforced
      • I understand whether quotas exist and how they are defined
      • I understand how preemption is triggered and executed
      • I understand what happens when resources are unavailable
      • I understand what the user sees when a workload is blocked or preempted

       

      Ownership & Quotas

      • I understand how GPU ownership is represented in each system
      • I understand whether ownership is static or dynamic
        • GPU-type aware
        • Namespace-aware
        • Priority-aware

       


      Knowledge Validation – Questions You Must Be Able to Answer

       

      Each participant must be able to confidently answer all of the following

       

      Scheduling & Fairness

      • What problem does fairness solve that priority does not?
      • When does fairness actively hurt critical workloads?
      • Can FIFO ever be fair? When?

       

      Priority

      • What does “priority” actually mean in each system?
      • Is priority absolute or relative?
      • Can low-priority workloads starve forever?
      • Can priority change dynamically at runtime?

      Preemption

      • What triggers preemption?
      • Is preemption deterministic or best-effort?
      • Can partial preemption exist in GPU?
      • What are the risks of aggressive preemption?
      • How does preemption impact user trust?

       

      GPU Semantics

      • How does the system reason about GPU types?
      • Can it distinguish between “any GPU” and “specific GPU models”?
      • How does heterogeneity affect fairness?

       

      Quotas & Ownership

      • How is GPU ownership expressed?
      • Is ownership enforced or advisory?
      • Can quotas encode priority?
      • What happens when a team exceeds its quota?
      • Can a team with zero ownership still run workloads? how? why there is that ability?

       Output Expectation

      The output of this epic must prove that:

      • The ecosystem is understood, not guessed
      • Assumptions are surfaced early

      If someone cannot answer the questions above, the epic is not done, regardless of documents produced.

              rh-ee-abadli Aviran Badli
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: