Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10229

Centralized GPU Data Collection Architecture

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Model Validation

      Context

      Across AIPCC, GPU data exists in many clusters and environments, but there is no standardized, automated way to collect it into a single, trusted system.

      Before thinking about dashboards or user experience, we should define how GPU data flows:

      from nodes and clusters → into a centralized data store → in a reliable, near real-time way.

      This story focuses purely on data collection architecture, operational enablement, and adoption readiness.

       

      Objective

      Define an initial, scalable architecture for continuously collecting GPU inventory and usage data from many Kubernetes clusters into a centralized data store.

      This includes not only the technical design, but also clear documentation and "onboarding requirements" that enable other teams to participate.

       

      Key Design Questions

      This story must explicitly answer:

      • What data is collected at the node level vs cluster level?
      • What components must run on each node (if any)?
      • What components must run at the cluster level?
      • Do we pull data centrally, or do clusters push data?
      • What is the collection frequency and latency?
      • How is data normalized across clusters?
      • How is data validated and monitored?
      • How do we detect broken collectors or stale data?

       

      Scope

      In scope:

      • Identify data sources (DCGM, kubelet, Kubernetes API, cloud APIs, billing, labels)
      • Define node-level and cluster-level collectors
      • Define push vs pull ingestion models
      • Define centralized data storage (metrics / time-series / relational)
      • Define update frequency and freshness guarantees (≤ 1 hour)
      • Define health checks and data validation
      • Define operational requirements for teams

       

      Out of scope:

      • Dashboard UX or visualization
      • GPU scheduling, quotas, priorities, or preemption
      • GPUaaS behavior or policy enforcement

       

      Architecture Requirements

      The architecture must:

      • Support multiple clusters and environments
      • Be cloud-agnostic
      • Minimize operational and security burden on teams
      • Avoid intrusive or high-risk components
      • Support incremental onboarding
      • Clearly separate data collection from data consumption

       

      Team Enablement & Adoption

      This is a hard requirement.

      As part of this story, we must produce clear guidance for other teams, including:

      • What must be installed on clusters
      • What must be installed on nodes (if applicable)
      • Required permissions and access
      • Required labels, annotations, or metadata
      • Expected data exposure format
      • Ownership and support model
      • etc

       

      A technically correct architecture without clear adoption guidance is considered incomplete.

      Documentation Requirement

      This story must deliver a dedicated architecture document that includes:

      • Architecture overview and data flow
      • Component responsibilities
      • Deployment models
      • Security  + Network considerations***
      • Onboarding steps for new clusters

      This document is the primary artifact used to engage other teams.

       

      Architecture Outputs – Mandatory Adoption Contract

      This story must produce not only a target architecture, but also a mandatory adoption contract derived from that architecture.

      The adoption contract explicitly defines what other teams must install, configure, or expose in their clusters or nodes in order to participate.

      This includes:

      • Mandatory node-level components (for example: DCGM, exporters, agents)
      • Mandatory cluster-level components (operators, CRDs, APIs)
      • Required labels, annotations, or metadata
      • Required data exposure mechanisms (push vs pull)
      • Required data freshness and health validation checks

      These requirements are non-negotiable technical prerequisites derived from the architecture.

      If a team cannot meet these requirements, their GPUs will not be visible in the system.

      Data Integrity & Validation

      The design must include:

      • Freshness signals and heartbeats per cluster
      • Detection of missing, partial, or stale data
      • Sanity checks (GPU count mismatches, zero-usage anomalies)

       

      Deliverables

      • Architecture diagram focused on data flow
      • Written architecture and onboarding document
      • Defined installation requirements for clusters and nodes
      • Defined ingestion and validation mechanisms
      • List of known gaps, assumptions, and risks

       

      DoD

      This story is complete when:

      • A clear, end-to-end data collection architecture exists
      • Required components per node and per cluster are documented
      • Data ingestion and validation paths are defined
      • A written architecture and onboarding document exists
      • The design is reviewed by other team members

       

      Collaboration & Ongoing Alignment 

      As part of this story, the work will be done in close collaboration with our core working group.

       

      Throughout the execution of this story:

      • Intermediate designs and decisions will be shared with the core team on an ongoing basis
      • Key milestones and architectural steps will be presented as they evolve, not only at the end
      • Feedback will be actively incorporated to reduce blind spots and course-correct early

       

      The goal is to ensure:

      • Shared understanding of the architecture as it forms
      • Early validation of assumptions
      • Strong alignment before engaging broader teams across the organization

      This collaboration is considered part of the execution, not an optional review step.

       

       

              rh-ee-scondon Sean Condon
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: