-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.12.0
-
Important
-
No
-
SDN Sprint 250
-
1
-
False
-
-
-
-
Large cluster launching 128 pods all at once; work around is to set pytorch jobs to restart
-
-
-
ENV:
Cluster Version: 4.12.30
Infrastructure
--------------
Platform: IBMCloud
Install Type: UPI
Network
-------
Network Type: OVNKubernetes
Description of problem:
The OpenShift cluster is large, with greater than 200 workers. When we launch pytorchjobs at scales of 128 pods, we often have a single pod that fails with error messages related to CNI request failed and timed out waiting for annotations, context deadline exceeded.
Its not 100 percent replicable. After multiple pytorchjob launch attempts, we often do successfully get the full 128 job launched. But it is not without many attempts that cause cluster users time delays.
~~~
Warning FailedCreatePodSandBox 85s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_rp-granite-code-8b-4k-l1-r1-worker-48_granite-prod_9e1ea9da-42aa-4bd4-a34a-6381a96e6590_0(47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f): error adding pod granite-prod_rp-granite-code-8b-4k-l1-r1-worker-48 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48/9e1ea9da-42aa-4bd4-a34a-6381a96e6590:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
~~~