Type: Story
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Model Validation
Labels:
- GPUaaS

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
GPUaaS Technology Research & Hands-On Evaluation
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Research Task – What You Are Expected To Do

As part of this epic, each participant is required to actively read documentation and learning materials related to GPU scheduling and resource management systems, including (but not limited to):

• Run:AI

• Kubernetes Kueue

• Kubernetes DRA

• YARN-style schedulers and fairness models
• Volcano

• Related concepts around GPU scheduling, quotas, priorities, and preemption

The expectation is not passive reading.

You are expected to study how these systems actually work, understand their concepts, and be able to reason about their behavior.

This task is intentionally front-loaded with theory and reading. Hands-on experimentation comes later. The goal here is to build shared mental models before touching systems

Note

Why We Explicitly Study YARN and Volcano (Even Though They Are Not Kubernetes-Native)

As part of this research, it is intentionally required to study systems such as YARN and Volcano, even though they are not native Kubernetes GPU schedulers.

This is not accidental.

YARN represents one of the most mature and battle-tested resource scheduling systems in large-scale distributed computing. Many of the concepts we discuss today in GPUaaS did not originate in Kubernetes. They were first explored, refined, and stress-tested in YARN and similar systems at massive scale.

By studying YARN, we gain deep understanding of:

• Fairness vs strict priority trade-offs

• FIFO, LIFO, weighted fairness, and DRF scheduling models

• Preemption strategies and their failure modes

• How ownership and quotas evolve over time

• Why certain “simple” scheduling ideas break at scale

These concepts directly influence modern systems, even when they are implemented differently.

Volcano is included because it is effectively the bridge between YARN-style batch scheduling and Kubernetes.

It brings many of the same ideas (queues, priorities, fairness, gang scheduling) into the Kubernetes ecosystem and exposes where Kubernetes-native scheduling still struggles.

Studying Volcano helps answer questions such as:

• What happens when YARN ideas are mapped onto Kubernetes primitives?

• Where does Kubernetes make things easier, and where does it make them harder?

• Which concepts translate cleanly, and which do not?

The goal is not to adopt YARN or Volcano.

The goal is to understand the ecosystem of ideas behind GPU scheduling, fairness, quotas, and preemption, so that GPUaaS decisions are made consciously, with full awareness of historical lessons and known pitfalls.

Skipping this context almost guarantees repeating old mistakes.

Purpose of This Research

The goal of this task is to build a strong conceptual foundation around the GPUaaS ecosystem before we start designing or implementing anything.

GPUaaS is not a single feature or product.

It is an intersection of scheduling theory, fairness models, quota enforcement, ownership semantics, preemption behavior, and GPU hardware constraints.

Without a deep understanding of these concepts:

• Technology comparisons become superficial

• Architecture decisions are based on assumptions

• Critical edge cases are discovered too late

This research ensures that when we later evaluate technologies and define architecture, we are doing so based on real understanding, not terminology or marketing claims.

Mandatory Research Checklist

Each participant must complete all items below.

This checklist is part of the Definition of Done for this epic.

Conceptual Understanding

I understand the difference between priority-based scheduling and fairness-based scheduling{}

I can explain FIFO, LIFO, weighted fairness, and DRF in plain English

I understand the trade-offs between strict priority and fair sharing{}

I understand how preemption works conceptually and why it is hard to get right

I understand how GPU heterogeneity impacts scheduling decisions

System-Level Understanding

For each evaluated technology (Run:AI, Kueue, DRA, YARN-style schedulers):

I understand how scheduling decisions are made

I understand how priorities are expressed and enforced

I understand whether quotas exist and how they are defined

I understand how preemption is triggered and executed

I understand what happens when resources are unavailable

I understand what the user sees when a workload is blocked or preempted

Ownership & Quotas

I understand how GPU ownership is represented in each system

I understand whether ownership is static or dynamic

- GPU-type aware

- Namespace-aware

- Priority-aware

Knowledge Validation – Questions You Must Be Able to Answer

Each participant must be able to confidently answer all of the following

Scheduling & Fairness

What problem does fairness solve that priority does not?

When does fairness actively hurt critical workloads?

Can FIFO ever be fair? When?

Priority

What does “priority” actually mean in each system?

Is priority absolute or relative?

Can low-priority workloads starve forever?

Can priority change dynamically at runtime?

Preemption

What triggers preemption?

Is preemption deterministic or best-effort?

Can partial preemption exist in GPU?

What are the risks of aggressive preemption?

How does preemption impact user trust?

GPU Semantics

How does the system reason about GPU types?

Can it distinguish between “any GPU” and “specific GPU models”?

How does heterogeneity affect fairness?

Quotas & Ownership

How is GPU ownership expressed?

Is ownership enforced or advisory?

Can quotas encode priority?

What happens when a team exceeds its quota?

Can a team with zero ownership still run workloads? how? why there is that ability?

Output Expectation

The output of this epic must prove that:

The ecosystem is understood, not guessed

Assumptions are surfaced early

If someone cannot answer the questions above, the epic is not done, regardless of documents produced.

1.	Mandatory Research – Follow Parent Story Instructions	New	Sean Condon
2.	Mandatory Research – Follow Parent Story Instructions	Closed	Wesley Spinks
3.	Mandatory Research – Follow Parent Story Instructions	In Progress	Amit Mendelevitch
4.	Mandatory Research – Follow Parent Story Instructions	New	Jose Angel Morena

Details

Description

Research Task – What You Are Expected To Do

This task is intentionally front-loaded with theory and reading. Hands-on experimentation comes later. The goal here is to build shared mental models before touching systems

Why We Explicitly Study YARN and Volcano (Even Though They Are Not Kubernetes-Native)

Purpose of This Research

Mandatory Research Checklist

Conceptual Understanding

System-Level Understanding

Ownership & Quotas

Knowledge Validation – Questions You Must Be Able to Answer

Scheduling & Fairness

Priority

Preemption

GPU Semantics

Quotas & Ownership

Output Expectation

Attachments

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates

PagerDuty

Hide