Uploaded image for project: 'OpenShift Console'
  1. OpenShift Console
  2. CONSOLE-3406

NVIDIA GPU administration dashboard not showing metrics

XMLWordPrintable

    • False
    • None
    • False

      case num. 03410765

       

      We have installed NVIDIA GPU administration dashboard as following link

      https://docs.openshift.com/container-platform/4.11/monitoring/nvidia-gpu-admin-dashboard.html

      But we dont see all data in Dashboard. GPU's are showing as 0. For other things, we see as "No datapoint founds". Please check the attached picture. We see this behaviour in our both clusters. Although Graphics Cards in cluster inventory are increasing as expected.

       

      Metrics shows nothing:
      count(count by (UUID,GPU_I_ID) (DCGM_FI_PROF_GR_ENGINE_ACTIVE{exported_pod=~".+"})) or vector(0)
      0
      count(count by (UUID,GPU_I_ID) (DCGM_FI_DEV_MEM_COPY_UTIL))
      No datapoints found
      sum(max by (UUID) (DCGM_FI_DEV_POWER_USAGE))
      No datapoints found
       

      Console and GPU operator shows everything is fine

      2023-01-11T12:48:13.007699309Z device-plugin workload validation is successful
      2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'toolkit-ready' is ready"
      2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'cuda-ready' is ready"
      2023-01-11T12:48:22.558594743Z time="2023-01-11T12:48:22Z" level=info msg="metrics: DevicePlugin validation: found 1 GPUs exposed by the DevicePlugin"

      monitoring variables are set

      openshift.io/cluster-monitoring: "true"

      possibly https://github.com/NVIDIA/gpu-operator/issues/469

              Unassigned Unassigned
              rhn-support-pducai Peter Ducai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: