Loading...

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Model Validation
Labels:
- GPUaaS

Epic Name:
Automated GPU Inventory & Live Visibility
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Status:
To Do
Parent Link:
AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
Hierarchy Progress Bar:

0% To Do, 100% In Progress, 0% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Context

Across AIPCC, there is currently no single, authoritative, and continuously updated view of GPU resources.

GPU information is fragmented across spreadsheets, cloud consoles, Kubernetes clusters, billing systems, and personal knowledge.

As a result, leadership and teams cannot reliably answer basic questions such as:

Which GPUs exist?

Who owns them?

Where do they run?

How much are they actually used?

How much do they cost?

Before advancing GPU sharing, quotas, or scheduling, we must first establish automated, trusted visibility into GPU ownership and usage.

Objective

Design and implement an automated solution that continuously collects, normalizes, and exposes GPU inventory, usage, and cost data across teams and environments, and presents it in a single, unified dashboard.

This epic creates a near real-time source of truth that enables analysis, decision-making, and organizational alignment.

Important Note

This epic is not GPUaaS.

We recognize that identifying and orchestrating GPU usage across all clusters, environments, and cloud accounts will take time.

However, automated GPU inventory and visibility is a foundational and independent step that delivers immediate value.

This work enables transparency and planning even without quotas, priorities, or preemption.

It is a prerequisite for GPUaaS, but not dependent on GPUaaS.

Scope

In Scope

• Automated discovery of GPUs across environments

• Automated collection of usage and utilization metrics

• Mapping GPUs to teams, ownership, environments, and clusters

• Cost attribution at a GPU or pool level (best-effort)

• Near real-time updates (maximum 1-hour freshness)

• Centralized data model

• A single, unified dashboard for analysis

Out of Scope

• GPU scheduling

• Quota enforcement

• Priority handling

• Preemption

• Workload placement logic

Architecture & Cross-Team Adoption Requirement

The architecture must be acceptable to other teams across AIPCC.

This is a hard requirement.

The design must assume that:

• Some teams may need to push data into the system

• Other teams may require a lightweight controller / agent / integration running in their environment

Therefore, the architecture must:

• Minimize operational burden

• Avoid intrusive or risky components

• Be clearly documented and easy to adopt

• Respect team ownership and autonomy

A solution that is technically sound but not adopted by teams is considered a failure.

Key Capabilities

The system must enable:

• Identifying which GPUs exist and where they run

• Mapping GPUs to owning teams and environments

• Distinguishing used vs idle GPUs

• Tracking usage over time (patterns, not just snapshots)

• Associating GPU capacity with estimated cost

• Highlighting underutilization and inefficiencies

Dashboard Requirements

A single unified dashboard must exist and support:

• Filtering by team

• Filtering by environment (prod, dev, poc, etc.)

• Filtering by cloud or cluster

• Understanding who used which GPUs and when

• Identifying usage patterns and trends over time

The dashboard must support real decisions, and be usable by:

• Engineering managers

• Org leadership (for example Tom)

• Platform and infra stakeholders

Execution Approach

This epic includes both design and implementation.

Work includes:

• Identifying and validating data sources (Kubernetes, cloud APIs, DCGM, billing, labels, etc.)

• Defining a unified GPU data model

• Building automated data ingestion pipelines

• Normalizing and correlating data across sources

• Implementing the initial dashboard

Data Freshness Requirement

Data must be continuously updated, with a maximum delay of one hour.

Manual or ad-hoc updates are not acceptable.

Deliverables

• Implemented automated GPU inventory and usage collection

• Unified data model for GPUs, ownership, and usage

• Live dashboard with filtering and drill-down

• Documentation of architecture and integration options

• Clear list of known gaps, assumptions, and risks

DoD

This epic is complete when:

• GPU inventory and usage are collected automatically

• Data freshness is within one hour

• A single dashboard is live and accessible

• Managers can see which GPUs they own, how much they cost, and how they are used

• Usage patterns and inefficiencies can be clearly identified

• The architecture is reviewed and accepted as adoptable by other teams

• Known limitations are explicitly documented

Details

Description

Context

Objective

Important Note

Scope

In Scope

Out of Scope

Architecture & Cross-Team Adoption Requirement

Key Capabilities

Dashboard Requirements

Execution Approach

Data Freshness Requirement

Deliverables

DoD

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty