Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-7576

RFE GPU Metrics in RHACM Observability

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • Observability

      1. Proposed title of this feature request

      GPU Metrics in RHACM Observability

      2. What is the nature and description of the request?

      Customers are using OpenShift Container Platform for AI / ML workloads. As a result of that, customers are using GPUs in OpenShift Container Platform Worker Nodes to accelerate certain workload.

      Using our GPU Operator, customers automatically get access to GPU metrics, for example via the NVIDIA DCGM Exporter: https://github.com/NVIDIA/dcgm-exporter / https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html

      This request asks for these metrics to be sent to RHACM as well so customers can centrally view and manage their GPU resources. Should metrics be present, RHACM could also display a dashboard for these metrics.

      3. Why does the customer need this? (List the business requirements here)

      Customers are using OpenShift Container Platform for AI / ML workloads. To better utilise resources and to better understand general GPU usage, these metrics should be available centrally in RHACM.

      4. List any affected packages or components.

      RHACM Observability

              rhn-support-cstark Christian Stark
              rhn-support-skrenger Simon Krenger
              Christian Stark Christian Stark
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: