-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
-
case num. 03410765
We have installed NVIDIA GPU administration dashboard as following link
https://docs.openshift.com/container-platform/4.11/monitoring/nvidia-gpu-admin-dashboard.html
But we dont see all data in Dashboard. GPU's are showing as 0. For other things, we see as "No datapoint founds". Please check the attached picture. We see this behaviour in our both clusters. Although Graphics Cards in cluster inventory are increasing as expected.
Metrics shows nothing:
count(count by (UUID,GPU_I_ID) (DCGM_FI_PROF_GR_ENGINE_ACTIVE{exported_pod=~".+"})) or vector(0)
0
count(count by (UUID,GPU_I_ID) (DCGM_FI_DEV_MEM_COPY_UTIL))
No datapoints found
sum(max by (UUID) (DCGM_FI_DEV_POWER_USAGE))
No datapoints found
Console and GPU operator shows everything is fine
2023-01-11T12:48:13.007699309Z device-plugin workload validation is successful
2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'toolkit-ready' is ready"
2023-01-11T12:48:22.425478014Z time="2023-01-11T12:48:22Z" level=info msg="metrics: StatusFile: 'cuda-ready' is ready"
2023-01-11T12:48:22.558594743Z time="2023-01-11T12:48:22Z" level=info msg="metrics: DevicePlugin validation: found 1 GPUs exposed by the DevicePlugin"
monitoring variables are set
openshift.io/cluster-monitoring: "true"