-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Automated GPU Inventory & Live Visibility
-
False
-
-
False
-
To Do
-
AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
-
0% To Do, 100% In Progress, 0% Done
Context
Across AIPCC, there is currently no single, authoritative, and continuously updated view of GPU resources.
GPU information is fragmented across spreadsheets, cloud consoles, Kubernetes clusters, billing systems, and personal knowledge.
As a result, leadership and teams cannot reliably answer basic questions such as:
Which GPUs exist?
Who owns them?
Where do they run?
How much are they actually used?
How much do they cost?
Before advancing GPU sharing, quotas, or scheduling, we must first establish automated, trusted visibility into GPU ownership and usage.
Objective
Design and implement an automated solution that continuously collects, normalizes, and exposes GPU inventory, usage, and cost data across teams and environments, and presents it in a single, unified dashboard.
This epic creates a near real-time source of truth that enables analysis, decision-making, and organizational alignment.
Important Note
This epic is not GPUaaS.
We recognize that identifying and orchestrating GPU usage across all clusters, environments, and cloud accounts will take time.
However, automated GPU inventory and visibility is a foundational and independent step that delivers immediate value.
This work enables transparency and planning even without quotas, priorities, or preemption.
It is a prerequisite for GPUaaS, but not dependent on GPUaaS.
Scope
In Scope
• Automated discovery of GPUs across environments
• Automated collection of usage and utilization metrics
• Mapping GPUs to teams, ownership, environments, and clusters
• Cost attribution at a GPU or pool level (best-effort)
• Near real-time updates (maximum 1-hour freshness)
• Centralized data model
• A single, unified dashboard for analysis
Out of Scope
• GPU scheduling
• Quota enforcement
• Priority handling
• Preemption
• Workload placement logic
Architecture & Cross-Team Adoption Requirement
The architecture must be acceptable to other teams across AIPCC.
This is a hard requirement.
The design must assume that:
• Some teams may need to push data into the system
• Other teams may require a lightweight controller / agent / integration running in their environment
Therefore, the architecture must:
• Minimize operational burden
• Avoid intrusive or risky components
• Be clearly documented and easy to adopt
• Respect team ownership and autonomy
A solution that is technically sound but not adopted by teams is considered a failure.
Key Capabilities
The system must enable:
• Identifying which GPUs exist and where they run
• Mapping GPUs to owning teams and environments
• Distinguishing used vs idle GPUs
• Tracking usage over time (patterns, not just snapshots)
• Associating GPU capacity with estimated cost
• Highlighting underutilization and inefficiencies
Dashboard Requirements
A single unified dashboard must exist and support:
• Filtering by team
• Filtering by environment (prod, dev, poc, etc.)
• Filtering by cloud or cluster
• Understanding who used which GPUs and when
• Identifying usage patterns and trends over time
The dashboard must support real decisions, and be usable by:
• Engineering managers
• Org leadership (for example Tom)
• Platform and infra stakeholders
Execution Approach
This epic includes both design and implementation.
Work includes:
• Identifying and validating data sources (Kubernetes, cloud APIs, DCGM, billing, labels, etc.)
• Defining a unified GPU data model
• Building automated data ingestion pipelines
• Normalizing and correlating data across sources
• Implementing the initial dashboard
Data Freshness Requirement
Data must be continuously updated, with a maximum delay of one hour.
Manual or ad-hoc updates are not acceptable.
Deliverables
• Implemented automated GPU inventory and usage collection
• Unified data model for GPUs, ownership, and usage
• Live dashboard with filtering and drill-down
• Documentation of architecture and integration options
• Clear list of known gaps, assumptions, and risks
DoD
This epic is complete when:
• GPU inventory and usage are collected automatically
• Data freshness is within one hour
• A single dashboard is live and accessible
• Managers can see which GPUs they own, how much they cost, and how they are used
• Usage patterns and inefficiencies can be clearly identified
• The architecture is reviewed and accepted as adoptable by other teams
• Known limitations are explicitly documented