Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4728

GPU Metrics in User Workload Monitoring

    XMLWordPrintable

Details

    • Feature Request
    • Resolution: Unresolved
    • Normal
    • None
    • 4.13
    • Monitoring
    • False
    • None
    • False
    • Not Selected
    • x86_64
    • 0
    • 0% 0%
    • Red Hat OpenShift AI

    Description

      1. Proposed title of this feature request

      GPU Metrics in User Workload Monitoring

      2. What is the nature and description of the request?

      Customers have AI / ML workloads that utilise GPUs heavily. GPUs expose metrics such as utilisation (see for example https://github.com/NVIDIA/dcgm-exporter and https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html for a list), which can be viewed by administrators using Prometheus, custom Dashboards or the GPU Monitoring Dashboard: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-monitoring-dashboard.html

      This RFE requests that GPU metrics that already set the "exported_namespace" label today are visible in the user workload monitoring overview for that particular namespace.

      Today customers do this with custom dashboards outside of OpenShift Container Platform.

      3. Why does the customer need this?

      Customers are using OpenShift Container Platform to run AI / ML workloads and they would like to provide the GPU metrics to the end users using OpenShift Container Platform. This allows customers to better utilise their GPU resources.

      4. List any affected packages or components.

      OpenShift Monitoring
      User Workload Monitoring

      Attachments

        Activity

          People

            rh-ee-rfloren Roger Florén
            rhn-support-skrenger Simon Krenger
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: