Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30899

error adding container to network "ovn-kubernetes": CNI request failed with status 400

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      Pods fail to get scheduled, they remain in ContainerCreating status and from the journal logs we see some ovn errors. 
      

      Version-Release number of selected component (if applicable):

      OCP 4.14.16 and nightlies after OpenShift 4.14 nightly 2024-03-08 18:06
      

       
      How reproducible:

          Randomly, and until now, only 1 node in the cluster shows this behaviour (not always the same node)
      

      Steps to Reproduce:

      1. Prepare NMstate manifest to use dual-stack through DHCP for LACP bond0 (br-ex), and bond0.vlanY (secondary bridge br-ex1)
      2. Deploy OCP 4.14 via IPI with latest nightly GA on a baremetal cluster with OVN-K and NMstate configuration in install-config.yaml as day1 (dedicated worker nodes)
      3. After the cluster is ready, apply a Performance Profile
      4. Create a basic application with a Deployment, and check the pods, in a replica of 3, sometimes a pod remains in ContainerCreating, and when checking other pods in that node, most of them are in the same status.
      5. Check the journal logs of the worker and look for errors such as *error adding container to network "ovn-kubernetes": CNI request failed with status 400*
      

      Actual results:

      No pods are scheduled in one of the worker nodes. In a random worker they remain in ContainerCreating status
      

      Expected results:

      Pods should be scheduled in any worker, and their status should be "Running"
      

      Affected Platforms:

      Only tested in Baremetal deployments with IPI and OVN-kubernetes
      

      Additional info:

      If we restart the ovnkube-node-* pod in that worker (delete the pod so it gets recreated) the pods are created, marked as Running and the log errors in the journal disappear. 
      

      More details:

      We noticed several pods not running, and all of them are in the same worker node

      $ oc get pods -A -o wide| grep -Eiv "running|complete"
      NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES
      myns                                               webserver-6dc5cb556d-5pb9g                                        0/1     ContainerCreating   0               49m     <none>          worker-2   <none>           <none>
      openshift-logging                                  cluster-logging-operator-666468c794-snd77                         0/1     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      openshift-monitoring                               thanos-querier-647c9db798-tbtjk                                   0/6     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-data                                           f5-tmm-557bd77784-qvdww                                           0/3     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-dns46                                          f5-tmm-78d4fbc46d-shxrs                                           0/3     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-test                                           f5-hello-world-74d48dc4c6-689jp                                   0/1     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-utilities                                      f5-cert-manager-webhook-6674ddd499-bzpb2                          0/1     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-utilities                                      f5-rabbit-565d9cc79d-fjl4s                                        0/1     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-utilities                                      f5-spk-cwc-7b44fbbcdf-tksxx                                       0/2     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-utilities                                      spk-utilities-f5-dssm-db-1                                        0/3     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      spk-utilities                                      spk-utilities-f5-dssm-sentinel-0                                  0/3     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      trident                                            trident-controller-86867589c8-bl2wt                               0/6     ContainerCreating   0               10m     <none>          worker-2   <none>           <none>
      

      In the journal log of the worker we could see messages like this:

       Warning  FailedCreatePodSandBox  2m10s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webserver-6dc5cb556d-5pb9g_myns_acf475b2-3b7b-4861-9bc8-9c2b14285b85_0(fccefe3b404b
      c01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601): error adding pod myns_webserver-6dc5cb556d-5pb9g to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request fail
      ed with status 400: '&{ContainerID:fccefe3b404bc01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601 Netns:/var/run/netns/6c2cd468-28b9-42cf-b8b8-33719c353888 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=myns;K8S_POD_NAME=webserver-6
      dc5cb556d-5pb9g;K8S_POD_INFRA_CONTAINER_ID=fccefe3b404bc01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601;K8S_POD_UID=acf475b2-3b7b-4861-9bc8-9c2b14285b85 Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 9
      8 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 
      116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 
      105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116
       34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110
       101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 $00 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 1$3 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 $
      16 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 12$]} ContainerID:"fccefe3b404bc01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601" Netns:"/var/run/netns/6c2cd468-28b9-42cf-b8b8-33719c353888" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=myns;K8S_POD_NAME=webserver-6dc5cb556d-5p$9g;K8S_POD_INFRA_CONTAINER_ID=fccefe3b404bc01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601;K8S_POD_UID=acf475b2-3b7b-4861-9bc8-9c2b14285b85" Path:"" ERRORED: error configuring pod [myns/webserver-6dc5cb556d-5pb9g] networking: [myns/w$bserver-6dc5cb556d-5pb9g/acf475b2-3b7b-4861-9bc8-9c2b14285b85:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[myns/webserver-6dc5cb556d-5pb9g fccefe3b404bc01d645602d2c55283396f7c$
      54cb9abedbed4bf75c9886b9601 network default NAD default] [myns/webserver-6dc5cb556d-5pb9g fccefe3b404bc01d645602d2c55283396f7c854cb9abedbed4bf75c9886b9601 network default NAD default] failed to get pod annotation: timed out waiting for a$notations: context deadline exceeded
      

      The ovn pod is running in the worker-2, and there are no issues in the logs, also br-ex and br-ex1 interfaces look healthy (they have both ipv4 and ip6v)

      $ oc -n openshift-ovn-kubernetes get pods -o wide
      NAME                                     READY   STATUS    RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES                                                                                               
      ovnkube-control-plane-588d654c6d-bdrjl   2/2     Running   0               3h      192.168.12.22   master-1   <none>           <none>                                                                                                        
      ovnkube-control-plane-588d654c6d-kgltz   2/2     Running   0               3h14m   192.168.12.23   master-2   <none>           <none>                                                                                                        
      ovnkube-node-4wlw9                       8/8     Running   16              3h49m   192.168.12.25   worker-1   <none>           <none>                                                                                                        
      ovnkube-node-786qv                       8/8     Running   9 (3h14m ago)   4h28m   192.168.12.23   master-2   <none>           <none>                                                                                                        
      ovnkube-node-7ltvf                       8/8     Running   9 (3h1m ago)    4h28m   192.168.12.22   master-1   <none>           <none>                                                                                                        
      ovnkube-node-dmhm2                       8/8     Running   16              3h51m   192.168.12.27   worker-3   <none>           <none>                                                                                                        
      ovnkube-node-phm6h                       8/8     Running   25              3h49m   192.168.12.24   worker-0   <none>           <none>                                                                                                        
      ovnkube-node-vsmdx                       8/8     Running   32              3h49m   192.168.12.26   worker-2   <none>           <none>                                                                                                        
      ovnkube-node-zxhqh                       8/8     Running   9 (168m ago)    4h28m   192.168.12.21   master-0   <none>           <none>                                                                                                        
      
      $ oc -n openshift-ovn-kubernetes logs ovnkube-node-vsmdx | tail 
      Defaulted container "ovn-controller" out of: ovn-controller, ovn-acl-logging, kube-rbac-proxy-node, kube-rbac-proxy-ovn-metrics, northd, nbdb, sbdb, ovnkube-controller, kubecfg-setup (init)
      2024-03-13T14:43:16.092Z|00064|binding|INFO|Setting lport openshift-network-diagnostics_network-check-target-9m5mg ovn-installed in OVS
      2024-03-13T14:43:16.092Z|00065|binding|INFO|Setting lport openshift-network-diagnostics_network-check-target-9m5mg up in Southbound
      2024-03-13T14:43:16.092Z|00066|binding|INFO|Setting lport openshift-dns_dns-default-2gs7f ovn-installed in OVS
      2024-03-13T14:43:16.092Z|00067|binding|INFO|Setting lport openshift-dns_dns-default-2gs7f up in Southbound
      2024-03-13T14:43:16.236Z|00068|binding|INFO|Claiming lport openshift-ingress-canary_ingress-canary-vmshk for this chassis.
      2024-03-13T14:43:16.236Z|00069|binding|INFO|openshift-ingress-canary_ingress-canary-vmshk: Claiming 0a:58:0a:80:02:07 10.128.2.7 fd02:0:0:5::7
      2024-03-13T14:43:16.237Z|00070|binding|INFO|Setting lport openshift-ingress-canary_ingress-canary-vmshk down in Southbound
      2024-03-13T14:43:16.248Z|00071|binding|INFO|Setting lport openshift-ingress-canary_ingress-canary-vmshk ovn-installed in OVS
      2024-03-13T14:43:16.248Z|00072|binding|INFO|Setting lport openshift-ingress-canary_ingress-canary-vmshk up in Southbound
      2024-03-13T14:43:47.006Z|00073|memory_trim|INFO|Detected inactivity (last active 30003 ms ago): trimming memory
      
      [core@worker-2 ~]$ ip a s br-ex
      23: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether b8:83:03:8e:0e:dc brd ff:ff:ff:ff:ff:ff
          inet 192.168.12.26/24 brd 192.168.12.255 scope global dynamic noprefixroute br-ex
             valid_lft 2716sec preferred_lft 2716sec
          inet 169.254.169.2/29 brd 169.254.169.7 scope global br-ex
             valid_lft forever preferred_lft forever
          inet6 fd69::2/125 scope global nodad 
             valid_lft forever preferred_lft forever
          inet6 fd1c:61fe:bdf1:12::1a/128 scope global dynamic noprefixroute 
             valid_lft 6128sec preferred_lft 6128sec
          inet6 fe80::ba83:3ff:fe8e:edc/64 scope link noprefixroute 
             valid_lft forever preferred_lft forever
      [core@worker-2 ~]$ ip a s br-ex1
      24: br-ex1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether b8:83:03:8e:0e:dc brd ff:ff:ff:ff:ff:ff
          inet 192.168.16.143/26 brd 192.168.16.191 scope global dynamic noprefixroute br-ex1
             valid_lft 2718sec preferred_lft 2718sec
          inet6 fd48:de67:5083:16::36/128 scope global dynamic noprefixroute 
             valid_lft 6209sec preferred_lft 6209sec
          inet6 fe80::ba83:3ff:fe8e:edc/64 scope link noprefixroute 
             valid_lft forever preferred_lft forever
      

      I deleted the ovn pod in worker-2 to see if this clears out the issue

      $ oc -n openshift-ovn-kubernetes get pods -o wide | egrep "NAME|worker-2"
      NAME                                     READY   STATUS    RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES
      ovnkube-node-vsmdx                       8/8     Running   32              3h59m   192.168.12.26   worker-2   <none>           <none>
      
      $ oc -n openshift-ovn-kubernetes delete pod ovnkube-node-vsmdx
      pod "ovnkube-node-vsmdx" deleted
      
      $ oc -n openshift-ovn-kubernetes get pods -o wide | egrep "NAME|worker-2"
      NAME                                     READY   STATUS    RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES
      ovnkube-node-l6v76                       8/8     Running   0               36s     192.168.12.26   worker-2   <none>           <none>
      

      Then I wait for a while, and all previous pods were running in worker-2 after, no issues

      $ oc get pods -A -o wide| grep -Eiv "running|complete"
      NAMESPACE                                          NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE       NOMINATED NODE   READINESS GATES
      [kni@provisioner.cluster1.dfwt5g.lab ~]$ oc get pods -A -o wide| grep worker-2
      kube-system                                        istio-cni-node-f266h                                              1/1     Running     2               3h1m    10.128.2.19     worker-2   <none>           <none>
      myns                                               webserver-6dc5cb556d-5pb9g                                        1/1     Running     0               84m     10.128.2.16     worker-2   <none>           <none>
      openshift-cluster-node-tuning-operator             tuned-85g6g                                                       1/1     Running     4               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-dns                                      dns-default-2gs7f                                                 2/2     Running     8               4h22m   10.128.2.6      worker-2   <none>           <none>
      openshift-dns                                      node-resolver-srqvf                                               1/1     Running     4               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-image-registry                           node-ca-pwlln                                                     1/1     Running     4               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-ingress-canary                           ingress-canary-vmshk                                              1/1     Running     4               4h22m   10.128.2.7      worker-2   <none>           <none>
      openshift-kni-infra                                coredns-worker-2                                                  2/2     Running     8               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-kni-infra                                keepalived-worker-2                                               2/2     Running     8               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-logging                                  cluster-logging-operator-666468c794-snd77                         1/1     Running     0               45m     10.128.2.14     worker-2   <none>           <none>
      openshift-machine-config-operator                  machine-config-daemon-c8xrh                                       2/2     Running     8               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-monitoring                               node-exporter-w9h6l                                               2/2     Running     8               4h21m   192.168.12.26   worker-2   <none>           <none>
      openshift-monitoring                               thanos-querier-647c9db798-tbtjk                                   6/6     Running     0               45m     10.128.2.10     worker-2   <none>           <none>
      openshift-multus                                   multus-additional-cni-plugins-hq884                               1/1     Running     4               4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-multus                                   multus-zrg4b                                                      1/1     Running     6 (22m ago)     4h23m   192.168.12.26   worker-2   <none>           <none>
      openshift-multus                                   network-metrics-daemon-52v4p                                      2/2     Running     8               4h23m   10.128.2.4      worker-2   <none>           <none>
      openshift-network-diagnostics                      network-check-target-9m5mg                                        1/1     Running     4               4h23m   10.128.2.3      worker-2   <none>           <none>
      openshift-ovn-kubernetes                           ovnkube-node-l6v76                                                8/8     Running     0               22m     192.168.12.26   worker-2   <none>           <none>
      openshift-sriov-network-operator                   sriov-device-plugin-w76nv                                         1/1     Running     0               40m     192.168.12.26   worker-2   <none>           <none>
      openshift-sriov-network-operator                   sriov-network-config-daemon-h6clk                                 1/1     Running     4               4h4m    192.168.12.26   worker-2   <none>           <none>
      spk-data                                           f5-tmm-557bd77784-qvdww                                           3/3     Running     0               45m     10.128.2.12     worker-2   <none>           <none>
      spk-dns46                                          f5-tmm-78d4fbc46d-shxrs                                           3/3     Running     0               45m     10.128.2.17     worker-2   <none>           <none>
      spk-test                                           f5-hello-world-74d48dc4c6-689jp                                   1/1     Running     0               45m     10.128.2.13     worker-2   <none>           <none>
      spk-utilities                                      f5-cert-manager-webhook-6674ddd499-bzpb2                          1/1     Running     0               45m     10.128.2.8      worker-2   <none>           <none>
      spk-utilities                                      f5-rabbit-565d9cc79d-fjl4s                                        1/1     Running     0               45m     10.128.2.15     worker-2   <none>           <none>
      spk-utilities                                      f5-spk-cwc-7b44fbbcdf-tksxx                                       2/2     Running     0               45m     10.128.2.5      worker-2   <none>           <none>
      spk-utilities                                      spk-utilities-f5-dssm-db-1                                        3/3     Running     0               45m     10.128.2.18     worker-2   <none>           <none>
      spk-utilities                                      spk-utilities-f5-dssm-sentinel-0                                  3/3     Running     0               45m     10.128.2.9      worker-2   <none>           <none>
      trident                                            trident-controller-86867589c8-bl2wt                               6/6     Running     0               45m     10.128.2.11     worker-2   <none>           <none>
      trident                                            trident-node-linux-6tl8z                                          2/2     Running     4               3h2m    192.168.12.26   worker-2   <none>           <none>
      

              jtanenba@redhat.com Jacob Tanenbaum
              rhn-gps-manrodri Manuel Rodriguez
              Dave Wilson Dave Wilson
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: