[OCPBUGS-28377] OCP 4.12 CNI request failed with status 400 failed to get pod annotation: timed out waiting for annotations: context deadline exceeded - Red Hat Issue Tracker

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:SCALE
- Workaround:identified

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 250
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Impact Range:
PX Review Complete:
PX Technical Impact:
PX Technical Impact Notes:

ENV:
Cluster Version: 4.12.30

Infrastructure
--------------
Platform: IBMCloud
Install Type: UPI

Network
-------
Network Type: OVNKubernetes

Description of problem:
The OpenShift cluster is large, with greater than 200 workers. When we launch pytorchjobs at scales of 128 pods, we often have a single pod that fails with error messages related to CNI request failed and timed out waiting for annotations, context deadline exceeded.

Its not 100 percent replicable. After multiple pytorchjob launch attempts, we often do successfully get the full 128 job launched. But it is not without many attempts that cause cluster users time delays.

~~~
Warning FailedCreatePodSandBox 85s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_rp-granite-code-8b-4k-l1-r1-worker-48_granite-prod_9e1ea9da-42aa-4bd4-a34a-6381a96e6590_0(47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f): error adding pod granite-prod_rp-granite-code-8b-4k-l1-r1-worker-48 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48/9e1ea9da-42aa-4bd4-a34a-6381a96e6590:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] [granite-prod/rp-granite-code-8b-4k-l1-r1-worker-48 47fe6bb8e447ef3175c9f00043a0f76478028cc23a26fd7f72864235f524ec3f] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
~~~

Assignee:: Jacob Tanenbaum

Reporter:: Daniel Seals

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/01/26 5:20 PM

Updated:: 2024/03/04 1:19 PM

Resolved:: 2024/03/04 1:19 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide