-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Running on EC2 with large hardware spec worker nodes, such as G6.48xlarge with a NIC of 100Gbps, the GPU operator will sometimes fail to connect on the node with an error in the log Unable to connect to server: dial tcp <some redacted gateway ip>: i/o timeout" The next line is "Unable to get value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' label" The failure is a crashloopbackof of the nvidia device plugin validator pod which is complaining that "allocate failed due to required number of devices unavailable for nvidia.com/gpu. Requested 1, Available 0 which is unexpected". Also there is an associate nvidia-operator-validator that fails to initialize on the same node due to the failed validation of tje plugin validation container which I assume is related to the previous error. Rebooting the node seems to resolve the issue. There are times where it will happen repeatedly, and other times it won't happen for days, but it's a problem when using an autoscaler. It never seems to happen when using nodes with lower memory and network bandwidth.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
Sometimes
Steps to Reproduce:
1. Setup a cluster on EC2 2. Install the nvidia operator and OpenShift AI 3. Scale up the cluster using GPU node spec G6.48xlarge with a NIC of 100Gbps.
Actual results:
The GPU operator fails about 1 in 5 times with error Unable to connect to server: dial tcp <some redacted gateway ip>: i/o timeout" The next line is "Unable to get value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' label"
Expected results:
That the node starts properly and AI workloads are able to access the GPU
Additional info:
Rebooting the node always seems to resolve the issue.