-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Product / Portfolio Work
-
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview:
Enable seamless GPU observability in OpenShift by integrating NVIDIA DCGM (Data Center GPU Manager) metrics into the OpenShift Monitoring stack. This feature will provide one-click solution to collect, export, and visualize GPU telemetry using prebuilt dashboards.
Use Case:
OpenShift administrators running AI/ML workloads on GPU nodes often need deep insights into GPU utilization, temperature, memory, and clock events. Manually instrumenting and collecting these metrics is complex and error-prone. This feature simplifies GPU observability with minimal setup effort.
Key Metrics Collected:
- GPU Utilization (%)
- SM Utilization (%)
- GPU Temperature (°C)
- Framebuffer Memory Utilization (%)
- Clock Throttle Events (avg by model)
- Streaming Multiprocessor (SM) Clock Speed
- Power Consumption (W)
- I/O and memory throughput
- relates to
-
RFE-3175 "Access to low level HW and GPU metrics"
-
- Closed
-