Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4728

GPU Metrics in User Workload Monitoring

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.13
    • Monitoring
    • False
    • None
    • False
    • Not Selected
    • x86_64
    • Red Hat OpenShift AI

      1. Proposed title of this feature request

      GPU Metrics in User Workload Monitoring

      2. What is the nature and description of the request?

      Customers have AI / ML workloads that utilise GPUs heavily. GPUs expose metrics such as utilisation (see for example https://github.com/NVIDIA/dcgm-exporter and https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html for a list), which can be viewed by administrators using Prometheus, custom Dashboards or the GPU Monitoring Dashboard: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-monitoring-dashboard.html

      This RFE requests that GPU metrics that already set the "exported_namespace" label today are visible in the user workload monitoring overview for that particular namespace.

      Today customers do this with custom dashboards outside of OpenShift Container Platform.

      3. Why does the customer need this?

      Customers are using OpenShift Container Platform to run AI / ML workloads and they would like to provide the GPU metrics to the end users using OpenShift Container Platform. This allows customers to better utilise their GPU resources.

      4. List any affected packages or components.

      OpenShift Monitoring
      User Workload Monitoring

              rh-ee-rfloren Roger Florén
              rhn-support-skrenger Simon Krenger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: