Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3470

[OVN][IBM Cloud]Pod stuck in ContainerCreating on IBM Cloud with 65 workers when creating 2k pods/services/routes: failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Minor Minor
    • None
    • 4.12
    • Documentation
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      In 4.11, I opened bug https://bugzilla.redhat.com/show_bug.cgi?id=2084062 [4.11][OVN]Pod stuck in ContainerCreating: failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed). That bug happens on 120 worker nodes. 
      After investigation, dev thought it is related to the big worker nodes number on OVN. I reduced worker node number from 120 to 70, the issue doesn't happen on AWS and Azure. And added a release note to 4.11 https://bugzilla.redhat.com/show_bug.cgi?id=2084062#c63
      
      In 4.12 I tested with 65 nodes on IBM Public Cloud, the issue happens. 

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2022-11-07-181244

      How reproducible:

      Not reproduce on 65 nodes AWS OVN cluster.
      Reproduce on 65 nodes IBM Cloud cluster - sometimes, not all the time.

      Steps to Reproduce:

      1. Install IBM Public Cloud cluster, OVN network. vm_type_masters: 'bx2-8x32'
      vm_type_workers: 'bx2-4x16'
      2. Scaleup the cluster to 65 worker nodes. 
      3. Install 3 INFRA nodes and move ingress to the INFRA nodes
      4. Run router-perf test which will create 500x4 pods/routes/services

      Actual results:

      Some test pods stuck in ContainerCreating for over 2 hours and not recover.
      New pod creation stuck in ContainerCreating too.
      Describing the ContainerCreating pod got the following events
      
      Warning  FailedCreatePodSandBox  95s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f [10.129.8.31/23]

      Expected results:

      Pods should be created successfully

      Additional info:

      Check all test pods/routes/serivces, the 500 pods in the last namespace were not Running. And lots of Events 'timed out waiting for OVS port binding' were seen.

       % Check all running test pods/services/endpoints were successfully created and the events of timed out 
      for termination in http edge passthrough reencrypt; do echo pods in http-scale-${termination}; oc get pods -n http-scale-${termination}| grep Running| wc -l; echo services in http-scale-${termination}; oc get services --no-headers -n http-scale-${termination} | wc -l; echo endpoints in http-scale-${termination}; oc get endpoints --no-headers -n http-scale-${termination} | wc -l; echo ovsportbinding_timoutout_events; oc get events -n http-scale-${termination} | grep 'timed out waiting for OVS port binding' | wc -l; done
      zsh: command not found: Check
      pods in http-scale-http
           500
      services in http-scale-http
           500
      endpoints in http-scale-http
           500
      ovsportbinding_timoutout_events
             0
      pods in http-scale-edge
           500
      services in http-scale-edge
           500
      endpoints in http-scale-edge
           500
      ovsportbinding_timoutout_events
           198
      pods in http-scale-passthrough
           500
      services in http-scale-passthrough
           500
      endpoints in http-scale-passthrough
           500
      ovsportbinding_timoutout_events
           722
      pods in http-scale-reencrypt
             0
      services in http-scale-reencrypt
           500
      endpoints in http-scale-reencrypt
           500
      ovsportbinding_timoutout_events
          5000

      Test pods stuck in ContainerCreating for over 2 hours and not recover, new created pods stuck in ContainerCreating too.

      ...
      http-scale-reencrypt                               http-perf-95-5df66ddf9c-4gb8n                                0/1   ContainerCreating   0               121m
      http-scale-reencrypt                               http-perf-96-5597889b4-2sxwh                                 0/1   ContainerCreating   0               120m
      http-scale-reencrypt                               http-perf-97-6ccffcb8dc-xh9vh                                0/1   ContainerCreating   0               121m
      http-scale-reencrypt                               http-perf-98-55c68557b6-69vtp                                0/1   ContainerCreating   0               121m
      http-scale-reencrypt                               http-perf-99-676d99cdfc-gvxbs                                0/1   ContainerCreating   0               121m
      openshift-marketplace                              certified-operators-77fxk                                    0/1   ContainerCreating   0               114m
      openshift-marketplace                              community-operators-6c7p9                                    0/1   ContainerCreating   0               119m
      openshift-marketplace                              redhat-marketplace-2xjk9                                     0/1   ContainerCreating   0               114m
      openshift-marketplace                              redhat-operators-wwbrp                                       0/1   ContainerCreating   0               119m
      openshift-operator-lifecycle-manager               collect-profiles-27801000-j68nj                              0/1   ContainerCreating   0               11m
      

      Describing one of the ContainerCreating pod

      % oc describe po -n http-scale-reencrypt http-perf-99-676d99cdfc-gvxbs 
      Name:             http-perf-99-676d99cdfc-gvxbs
      Namespace:        http-scale-reencrypt
      Priority:         0
      Service Account:  default
      Node:             qili-ibm1107-cl7jf-worker-1-x87cm/10.241.0.18
      Start Time:       Thu, 10 Nov 2022 12:10:26 +0800
      Labels:           app=nginx-99
                        pod-template-hash=676d99cdfc
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["10.129.8.31/23"],"mac_address":"0a:58:0a:81:08:1f","gateway_ips":["10.129.8.1"],"ip_address":"10.129.8.31/23"...
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Pending
      IP:               
      IPs:              <none>
      Controlled By:    ReplicaSet/http-perf-99-676d99cdfc
      Containers:
        nginx:
          Container ID:   
          Image:          quay.io/cloud-bulldozer/nginx:latest
          Image ID:       
          Port:           8080/TCP
          Host Port:      0/TCP
          State:          Waiting
            Reason:       ContainerCreating
          Ready:          False
          Restart Count:  0
          Requests:
            cpu:        10m
            memory:     10Mi
          Environment:  <none>
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wd8qt (ro)
      Conditions:
        Type              Status
        Initialized       True 
        Ready             False 
        ContainersReady   False 
        PodScheduled      True 
      Volumes:
        kube-api-access-wd8qt:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              node-role.kubernetes.io/worker=
      Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason                  Age   From               Message
        ----     ------                  ----  ----               -------
        Normal   Scheduled               19m   default-scheduler  Successfully assigned http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs to qili-ibm1107-cl7jf-worker-1-x87cm
        Warning  FailedCreatePodSandBox  17m   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs dd324ffe01c56a6df0f801b751848af6f07329df48f23bb542f7417d432a1db9] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f [10.129.8.31/23]
      '
      ....
        Warning  FailedCreatePodSandBox  95s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_http-perf-99-676d99cdfc-gvxbs_http-scale-reencrypt_521a40f0-950b-4ec4-9b13-47b7d983ae3e_0(73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7): error adding pod http-scale-reencrypt_http-perf-99-676d99cdfc-gvxbs to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs/521a40f0-950b-4ec4-9b13-47b7d983ae3e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] [http-scale-reencrypt/http-perf-99-676d99cdfc-gvxbs 73f6833ebf6cc2d457dd6542febf06d992c4af025d435f6cca3393cf716f29e7] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:08:1f [10.129.8.31/23]
      '
      
      

       

              lmurthy Latha Sreenivasa Murthy
              rhn-support-qili Qiujie Li
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: