Resolution: Duplicate
Description of problem:
In 4.11, I opened bug https://bugzilla.redhat.com/show_bug.cgi?id=2084062 [4.11][OVN]Pod stuck in ContainerCreating: failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed). That bug happens on 120 worker nodes. After investigation, dev thought it is related to the big worker nodes number on OVN. I reduced worker node number from 120 to 70, the issue doesn't happen on AWS and Azure. And added a release note to 4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2084062#c63 In 4.12 I tested with 65 nodes on IBM Public Cloud, the issue happens.
Version-Release number of selected component (if applicable):
How reproducible:
Not reproduce on 65 nodes AWS OVN cluster. Reproduce on 65 nodes IBM Cloud cluster - sometimes, not all the time.
Steps to Reproduce:
1. Install IBM Public Cloud cluster, OVN network. vm_type_masters: 'bx2-8x32' vm_type_workers: 'bx2-4x16' 2. Scaleup the cluster to 65 worker nodes. 3. Install 3 INFRA nodes and move ingress to the INFRA nodes 4. Run router-perf test which will create 500x4 pods/routes/services
Actual results:
Some test pods stuck in ContainerCreating for over 2 hours and not recover. New pod creation stuck in ContainerCreating too. Describing the ContainerCreating pod got the following events Warning FailedCreatePodSandBox 95s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f []
Expected results:
Pods should be created successfully
Additional info:
Check all test pods/routes/serivces, the 500 pods in the last namespace were not Running. And lots of Events 'timed out waiting for OVS port binding' were seen.
% Check all running test pods/services/endpoints were successfully created and the events of timed out for termination in http edge passthrough reencrypt; do echo pods in http-scale-${termination}; oc get pods -n http-scale-${termination}| grep Running| wc -l; echo services in http-scale-${termination}; oc get services --no-headers -n http-scale-${termination} | wc -l; echo endpoints in http-scale-${termination}; oc get endpoints --no-headers -n http-scale-${termination} | wc -l; echo ovsportbinding_timoutout_events; oc get events -n http-scale-${termination} | grep 'timed out waiting for OVS port binding' | wc -l; done zsh: command not found: Check pods in http-scale-http 500 services in http-scale-http 500 endpoints in http-scale-http 500 ovsportbinding_timoutout_events 0 pods in http-scale-edge 500 services in http-scale-edge 500 endpoints in http-scale-edge 500 ovsportbinding_timoutout_events 198 pods in http-scale-passthrough 500 services in http-scale-passthrough 500 endpoints in http-scale-passthrough 500 ovsportbinding_timoutout_events 722 pods in http-scale-reencrypt 0 services in http-scale-reencrypt 500 endpoints in http-scale-reencrypt 500 ovsportbinding_timoutout_events 5000
Test pods stuck in ContainerCreating for over 2 hours and not recover, new created pods stuck in ContainerCreating too.
http-scale-reencrypt http-perf-95-5df66ddf9c-4gb8n 0/1 ContainerCreating 0 121m
http-scale-reencrypt http-perf-96-5597889b4-2sxwh 0/1 ContainerCreating 0 120m
http-scale-reencrypt http-perf-97-6ccffcb8dc-xh9vh 0/1 ContainerCreating 0 121m
http-scale-reencrypt http-perf-98-55c68557b6-69vtp 0/1 ContainerCreating 0 121m
http-scale-reencrypt http-perf-99-676d99cdfc-gvxbs 0/1 ContainerCreating 0 121m
openshift-marketplace certified-operators-77fxk 0/1 ContainerCreating 0 114m
openshift-marketplace community-operators-6c7p9 0/1 ContainerCreating 0 119m
openshift-marketplace redhat-marketplace-2xjk9 0/1 ContainerCreating 0 114m
openshift-marketplace redhat-operators-wwbrp 0/1 ContainerCreating 0 119m
openshift-operator-lifecycle-manager collect-profiles-27801000-j68nj 0/1 ContainerCreating 0 11m
Describing one of the ContainerCreating pod
% oc describe po -n http-scale-reencrypt http-perf-99-676d99cdfc-gvxbs Name: http-perf-99-676d99cdfc-gvxbs Namespace: http-scale-reencrypt Priority: 0 Service Account: default Node: qili-ibm1107-cl7jf-worker-1-x87cm/ Start Time: Thu, 10 Nov 2022 12:10:26 +0800 Labels: app=nginx-99 pod-template-hash=676d99cdfc Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":[""],"mac_address":"0a:58:0a:81:08:1f","gateway_ips":[""],"ip_address":""... openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/http-perf-99-676d99cdfc Containers: nginx: Container ID: Image: quay.io/cloud-bulldozer/nginx:latest Image ID: Port: 8080/TCP Host Port: 0/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 10m memory: 10Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wd8qt (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-wd8qt: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/worker= Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 19m default-scheduler Successfully assigned http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs to qili-ibm1107-cl7jf-worker-1-x87cm Warning FailedCreatePodSandBox 17m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f [] ' .... Warning FailedCreatePodSandBox 95s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f [] '