Uploaded image for project: 'OpenShift Console'
  1. OpenShift Console
  2. CONSOLE-3406

NVIDIA GPU administration dashboard not showing metrics


    • False
    • None
    • False

      case num. 03410765


      We have installed NVIDIA GPU administration dashboard as following link


      But we dont see all data in Dashboard. GPU's are showing as 0. For other things, we see as "No datapoint founds". Please check the attached picture. We see this behaviour in our both clusters. Although Graphics Cards in cluster inventory are increasing as expected.


      Metrics shows nothing:
      count(count by (UUID,GPU_I_ID) (DCGM_FI_PROF_GR_ENGINE_ACTIVE{exported_pod=~".+"})) or vector(0)
      count(count by (UUID,GPU_I_ID) (DCGM_FI_DEV_MEM_COPY_UTIL))
      No datapoints found
      sum(max by (UUID) (DCGM_FI_DEV_POWER_USAGE))
      No datapoints found

      Console and GPU operator shows everything is fine

      2023-01-11T12:48:13.007699309Z device-plugin workload validation is successful
      2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'toolkit-ready' is ready"
      2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'cuda-ready' is ready"
      2023-01-11T12:48:22.558594743Z time="2023-01-11T12:48:22Z" level=info msg="metrics: DevicePlugin validation: found 1 GPUs exposed by the DevicePlugin"

      monitoring variables are set

      openshift.io/cluster-monitoring: "true"

      possibly https://github.com/NVIDIA/gpu-operator/issues/469

        1. inspect.local1.tar.gz
          1.01 MB
          Peter Ducai
        2. inspect.local2.tar.gz
          290 kB
          Peter Ducai

            Unassigned Unassigned
            rhn-support-pducai Peter Ducai
            0 Vote for this issue
            3 Start watching this issue
