Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-9828

Automated GPU Inventory & Live Visibility

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Model Validation
    • Automated GPU Inventory & Live Visibility
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • AIPCC-9826GPU as a Non-Issue – Inventory, Research, and Incremental GPUaaS Implementation (AIPCC Only)
    • 0% To Do, 100% In Progress, 0% Done

      Context

      Across AIPCC, there is currently no single, authoritative, and continuously updated view of GPU resources.

       

      GPU information is fragmented across spreadsheets, cloud consoles, Kubernetes clusters, billing systems, and personal knowledge.

      As a result, leadership and teams cannot reliably answer basic questions such as:

      Which GPUs exist?

      Who owns them?

      Where do they run?

      How much are they actually used?

      How much do they cost?

       

      Before advancing GPU sharing, quotas, or scheduling, we must first establish automated, trusted visibility into GPU ownership and usage.

       


      Objective

      Design and implement an automated solution that continuously collects, normalizes, and exposes GPU inventory, usage, and cost data across teams and environments, and presents it in a single, unified dashboard.

       

      This epic creates a near real-time source of truth that enables analysis, decision-making, and organizational alignment.

       


      Important Note

      This epic is not GPUaaS.

      We recognize that identifying and orchestrating GPU usage across all clusters, environments, and cloud accounts will take time.

      However, automated GPU inventory and visibility is a foundational and independent step that delivers immediate value.

       

      This work enables transparency and planning even without quotas, priorities, or preemption.

      It is a prerequisite for GPUaaS, but not dependent on GPUaaS.

       


      Scope

      In Scope

      • Automated discovery of GPUs across environments

      • Automated collection of usage and utilization metrics

      • Mapping GPUs to teams, ownership, environments, and clusters

      • Cost attribution at a GPU or pool level (best-effort)

      • Near real-time updates (maximum 1-hour freshness)

      • Centralized data model

      • A single, unified dashboard for analysis

       

      Out of Scope

      • GPU scheduling

      • Quota enforcement

      • Priority handling

      • Preemption

      • Workload placement logic

       


      Architecture & Cross-Team Adoption Requirement

      The architecture must be acceptable to other teams across AIPCC.

      This is a hard requirement.

      The design must assume that:

      • Some teams may need to push data into the system

      • Other teams may require a lightweight controller / agent / integration running in their environment

       

      Therefore, the architecture must:

      • Minimize operational burden

      • Avoid intrusive or risky components

      • Be clearly documented and easy to adopt

      • Respect team ownership and autonomy

       

      A solution that is technically sound but not adopted by teams is considered a failure.

       


      Key Capabilities

      The system must enable:

      • Identifying which GPUs exist and where they run

      • Mapping GPUs to owning teams and environments

      • Distinguishing used vs idle GPUs

      • Tracking usage over time (patterns, not just snapshots)

      • Associating GPU capacity with estimated cost

      • Highlighting underutilization and inefficiencies

       


      Dashboard Requirements

      A single unified dashboard must exist and support:

      • Filtering by team

      • Filtering by environment (prod, dev, poc, etc.)

      • Filtering by cloud or cluster

      • Understanding who used which GPUs and when

      • Identifying usage patterns and trends over time

       

      The dashboard must support real decisions, and be usable by:

      • Engineering managers

      • Org leadership (for example Tom)

      • Platform and infra stakeholders

       


      Execution Approach

      This epic includes both design and implementation.

      Work includes:

      • Identifying and validating data sources (Kubernetes, cloud APIs, DCGM, billing, labels, etc.)

      • Defining a unified GPU data model

      • Building automated data ingestion pipelines

      • Normalizing and correlating data across sources

      • Implementing the initial dashboard

       


      Data Freshness Requirement

      Data must be continuously updated, with a maximum delay of one hour.

      Manual or ad-hoc updates are not acceptable.

       


      Deliverables

      • Implemented automated GPU inventory and usage collection

      • Unified data model for GPUs, ownership, and usage

      • Live dashboard with filtering and drill-down

      • Documentation of architecture and integration options

      • Clear list of known gaps, assumptions, and risks

       


      DoD

      This epic is complete when:

      • GPU inventory and usage are collected automatically

      • Data freshness is within one hour

      • A single dashboard is live and accessible

      • Managers can see which GPUs they own, how much they cost, and how they are used

      • Usage patterns and inefficiencies can be clearly identified

      • The architecture is reviewed and accepted as adoptable by other teams

      • Known limitations are explicitly documented

              rh-ee-abadli Aviran Badli
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: