Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2119

DCGM Monitoring Dashboard for Nvidia GPU Workloads in OCP

XMLWordPrintable

    • Product / Portfolio Work
    • OCPSTRAT-1692AI Workloads for OpenShift
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview:

      Enable seamless GPU observability in OpenShift by integrating NVIDIA DCGM (Data Center GPU Manager) metrics into the OpenShift Monitoring stack. This feature will provide one-click solution to collect, export, and visualize GPU telemetry using prebuilt dashboards.


      Use Case:

      OpenShift administrators running AI/ML workloads on GPU nodes often need deep insights into GPU utilization, temperature, memory, and clock events. Manually instrumenting and collecting these metrics is complex and error-prone. This feature simplifies GPU observability with minimal setup effort.

       

       

      Key Metrics Collected:

      • GPU Utilization (%)
      • SM Utilization (%)
      • GPU Temperature (°C)
      • Framebuffer Memory Utilization (%)
      • Clock Throttle Events (avg by model)
      • Streaming Multiprocessor (SM) Clock Speed
      • Power Consumption (W)
      • I/O and memory throughput

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              None
              None
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: