Loading...

Type: Story
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Model Validation
Labels:
- GPUaaS

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Automated GPU Inventory & Live Visibility
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Context

Across AIPCC, GPU data exists in many clusters and environments, but there is no standardized, automated way to collect it into a single, trusted system.

Before thinking about dashboards or user experience, we should define how GPU data flows:

from nodes and clusters → into a centralized data store → in a reliable, near real-time way.

This story focuses purely on data collection architecture, operational enablement, and adoption readiness.

Objective

Define an initial, scalable architecture for continuously collecting GPU inventory and usage data from many Kubernetes clusters into a centralized data store.

This includes not only the technical design, but also clear documentation and "onboarding requirements" that enable other teams to participate.

Key Design Questions

This story must explicitly answer:

What data is collected at the node level vs cluster level?

What components must run on each node (if any)?

What components must run at the cluster level?

Do we pull data centrally, or do clusters push data?

What is the collection frequency and latency?

How is data normalized across clusters?

How is data validated and monitored?

How do we detect broken collectors or stale data?

Scope

In scope:

Identify data sources (DCGM, kubelet, Kubernetes API, cloud APIs, billing, labels)

Define node-level and cluster-level collectors

Define push vs pull ingestion models

Define centralized data storage (metrics / time-series / relational)

Define update frequency and freshness guarantees (≤ 1 hour)

Define health checks and data validation

Define operational requirements for teams

Out of scope:

Dashboard UX or visualization

GPU scheduling, quotas, priorities, or preemption

GPUaaS behavior or policy enforcement

Architecture Requirements

The architecture must:

Support multiple clusters and environments

Be cloud-agnostic

Minimize operational and security burden on teams

Avoid intrusive or high-risk components

Support incremental onboarding

Clearly separate data collection from data consumption

Team Enablement & Adoption

This is a hard requirement.

As part of this story, we must produce clear guidance for other teams, including:

What must be installed on clusters

What must be installed on nodes (if applicable)

Required permissions and access

Required labels, annotations, or metadata

Expected data exposure format

Ownership and support model

etc

A technically correct architecture without clear adoption guidance is considered incomplete.

Documentation Requirement

This story must deliver a dedicated architecture document that includes:

Architecture overview and data flow

Component responsibilities

Deployment models

Security + Network considerations***

Onboarding steps for new clusters

This document is the primary artifact used to engage other teams.

Architecture Outputs – Mandatory Adoption Contract

This story must produce not only a target architecture, but also a mandatory adoption contract derived from that architecture.

The adoption contract explicitly defines what other teams must install, configure, or expose in their clusters or nodes in order to participate.

This includes:

Mandatory node-level components (for example: DCGM, exporters, agents)

Mandatory cluster-level components (operators, CRDs, APIs)

Required labels, annotations, or metadata

Required data exposure mechanisms (push vs pull)

Required data freshness and health validation checks

These requirements are non-negotiable technical prerequisites derived from the architecture.

If a team cannot meet these requirements, their GPUs will not be visible in the system.

Data Integrity & Validation

The design must include:

Freshness signals and heartbeats per cluster

Detection of missing, partial, or stale data

Sanity checks (GPU count mismatches, zero-usage anomalies)

Deliverables

Architecture diagram focused on data flow

Written architecture and onboarding document

Defined installation requirements for clusters and nodes

Defined ingestion and validation mechanisms

List of known gaps, assumptions, and risks

DoD

This story is complete when:

A clear, end-to-end data collection architecture exists

Required components per node and per cluster are documented

Data ingestion and validation paths are defined

A written architecture and onboarding document exists

The design is reviewed by other team members

Collaboration & Ongoing Alignment

As part of this story, the work will be done in close collaboration with our core working group.

Throughout the execution of this story:

Intermediate designs and decisions will be shared with the core team on an ongoing basis

Key milestones and architectural steps will be presented as they evolve, not only at the end

Feedback will be actively incorporated to reduce blind spots and course-correct early

The goal is to ensure:

Shared understanding of the architecture as it forms

Early validation of assumptions

Strong alignment before engaging broader teams across the organization

This collaboration is considered part of the execution, not an optional review step.

Details

Description

Context

Objective

Key Design Questions

Scope

Architecture Requirements

Team Enablement & Adoption

Documentation Requirement

Deliverables

DoD

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty

Hide