Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14, 4.14.z
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CNF Network Sprint 257, CNF Network Sprint 258, CNF Network Sprint 259, CNF Network Sprint 261, CNF Network Sprint 262, CNF Network Sprint 263
sprint_count:
6

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Some pods fail to get scheduled after rebooting a node (worker), they remain in ContainerCreating status and from the journal logs we see some ovn errors.

Version-Release number of selected component (if applicable):

Observed since OCP 4.14.29 and latest nightlies as today (2024 Jul 29) OpenShift 4.14 nightly 2024-07-28 20:04

How reproducible:

80% of the time after rebooting a node (worker) - tested with 5 jobs (4 out 5 failed)

Steps to Reproduce:

1. Prepare NMstate manifest to use dual-stack through DHCP for LACP bond0 (br-ex), and bond0.vlanY (secondary bridge br-ex1)
2. Deploy OCP 4.14 via IPI with latest nightly GA on a baremetal cluster with OVN-K and NMstate configuration in install-config.yaml as day1 (dedicated worker nodes)
3. After the cluster is ready, apply a Performance Profile
4. Create a basic application with a Deployment, and reboot a worker node
5. After worker is back some pods remains in ContainerCreating, some others are working fine.
6. Check the journal logs of the worker and look for errors such as *error adding container to network "ovn-kubernetes": CNI request failed with status 400*

Actual results:

Some pods in the rebooted worker remain in ContainerCreating status

Expected results:

All Pods in the rebooted worker should be "Running"

Affected Platforms:

Only tested in Baremetal deployments with IPI and OVN-kubernetes

Additional info:

If we restart the ovnkube-node-* pod in the rebooted worker or the failed pods there is no effect, they continue in the same status

More details:

A few months ago we observed a similar error in earlier versions of OCP 4.14 and we opened ~~OCPBUGS-33721~~, it was fixed, however now the error is back and the work-around is not working this time.

We noticed some pods not running, and all of them are in the same worker node

$ oc get pods -A -o wide| grep -Eiv "running|complete"
NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES
openshift-operator-lifecycle-manager               collect-profiles-28704075-zjnjz                                   0/1     ContainerCreating   0              28s     <none>         worker-1   <none>           <none>
spk-data                                           f5-tmm-7958c97f7f-h2sfc                                           0/3     ContainerCreating   0              5m44s   <none>         worker-1   <none>           <none>
spk-utilities                                      spk-utilities-f5-dssm-db-0                                        0/3     Terminating         0              48m     <none>         worker-1   <none>           <none>
spk-utilities                                      spk-utilities-f5-dssm-sentinel-0                                  0/3     ContainerCreating   0              7m59s   <none>         worker-1   <none>           <none>
trident                                            trident-node-linux-7pnm8                                          0/2     CrashLoopBackOff    7 (116s ago)   57m     172.21.22.25   worker-1   <none>           <none>

In the pod events and journal logs of the rebooted worker we could see messages like this:

Jul 29 09:13:58 worker-1 kubenswrapper[7679]: E0729 09:13:58.819416    7679 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err=<
Jul 29 09:13:58 worker-1 kubenswrapper[7679]:   rpc error: code = Unknown desc = failed to create pod network sandbox k8s_f5-tmm-7958c97f7f-h2sfc_spk-data_4065d307-a42f-4411-907a-2e1167737f7a_0(7b00c34aa82478a93fd5bb56004646b76125e1d3cf22
abcf34f506ee515db546): error adding pod spk-data_f5-tmm-7958c97f7f-h2sfc to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{Containe
rID:7b00c34aa82478a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db546 Netns:/var/run/netns/a7f3dc11-be94-4017-902e-ccb1ef9bb2aa IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=spk-data;K8S_POD_NAME=f5-tmm-7958c97f7f-h2sfc;K8S_POD_INFRA_
CONTAINER_ID=7b00c34aa82478a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db546;K8S_POD_UID=4065d307-a42f-4411-907a-2e1167737f7a Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 
34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 1
10 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 4
4 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 10
8 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 11
1 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 
105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 10
9 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 
105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 1
05 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 1
11 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"7b00c34aa82478
a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db546" Netns:"/var/run/netns/a7f3dc11-be94-4017-902e-ccb1ef9bb2aa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=spk-data;K8S_POD_NAME=f5-tmm-7958c97f7f-h2sfc;K8S_POD_INFRA_CONTAINER_ID
=7b00c34aa82478a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db546;K8S_POD_UID=4065d307-a42f-4411-907a-2e1167737f7a" Path:"" ERRORED: error configuring pod [spk-data/f5-tmm-7958c97f7f-h2sfc] networking: [spk-data/f5-tmm-7958c97f7f-h2sfc/40
65d307-a42f-4411-907a-2e1167737f7a:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-data/f5-tmm-7958c97f7f-h2sfc 7b00c34aa82478a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db54
6 network default NAD default] [spk-data/f5-tmm-7958c97f7f-h2sfc 7b00c34aa82478a93fd5bb56004646b76125e1d3cf22abcf34f506ee515db546 network default NAD default] failed to get pod annotation: timed out waiting for annotations: context deadli
ne exceeded
Jul 29 09:13:58 worker-1 kubenswrapper[7679]:   '
Jul 29 09:13:58 worker-1 kubenswrapper[7679]:   '
Jul 29 09:13:58 worker-1 kubenswrapper[7679]:  >

links to

u/s ovn-org/ovn-kubernetes/pull/4793

Assignee:: sdn-team bot

Reporter:: Manuel Rodriguez

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/07/30 1:42 AM

Updated:: 2025/07/22 5:30 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide