-
Feature Request
-
Resolution: Done
-
Normal
-
None
-
4.13
-
False
-
None
-
False
-
Not Selected
-
x86_64
-
-
-
-
Red Hat OpenShift AI
1. Proposed title of this feature request
GPU Metrics in User Workload Monitoring
2. What is the nature and description of the request?
Customers have AI / ML workloads that utilise GPUs heavily. GPUs expose metrics such as utilisation (see for example https://github.com/NVIDIA/dcgm-exporter and https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html for a list), which can be viewed by administrators using Prometheus, custom Dashboards or the GPU Monitoring Dashboard: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-monitoring-dashboard.html
This RFE requests that GPU metrics that already set the "exported_namespace" label today are visible in the user workload monitoring overview for that particular namespace.
Today customers do this with custom dashboards outside of OpenShift Container Platform.
3. Why does the customer need this?
Customers are using OpenShift Container Platform to run AI / ML workloads and they would like to provide the GPU metrics to the end users using OpenShift Container Platform. This allows customers to better utilise their GPU resources.
4. List any affected packages or components.
OpenShift Monitoring
User Workload Monitoring