Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13
Component/s: ai-ml-workloads, AI/ML Workloads, Monitoring
Labels:

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:

Red Hat OpenShift AI
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

GPU Metrics in User Workload Monitoring

2. What is the nature and description of the request?

Customers have AI / ML workloads that utilise GPUs heavily. GPUs expose metrics such as utilisation (see for example https://github.com/NVIDIA/dcgm-exporter and https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html for a list), which can be viewed by administrators using Prometheus, custom Dashboards or the GPU Monitoring Dashboard: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-monitoring-dashboard.html

This RFE requests that GPU metrics that already set the "exported_namespace" label today are visible in the user workload monitoring overview for that particular namespace.

Today customers do this with custom dashboards outside of OpenShift Container Platform.

3. Why does the customer need this?

Customers are using OpenShift Container Platform to run AI / ML workloads and they would like to provide the GPU metrics to the end users using OpenShift Container Platform. This allows customers to better utilise their GPU resources.

4. List any affected packages or components.

OpenShift Monitoring
User Workload Monitoring

Assignee:: Roger Florén

Reporter:: Simon Krenger

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/09/28 10:32 AM

Updated:: 2025/10/27 3:12 PM

Resolved:: 2024/04/22 12:03 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates