Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37825

Pod probes failing under load conditions

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      While running a load test in a bare metal cluster, some pods got stuck in CrashLoopBackOff state because some of their liveness and readiness probes fail.

      The pods in question are simple http servers (nginx) and these probes are the typical httpGet probes pointing to the endpoint /

      The load test is executed under the following conditions, 4.14.27 cluster with 6 nodes, (3 workers and 3 masters) with maxPods configured to 500, and using OVNKubernetes in its IC fashion.

      The benchmark fills with pods the worker nodes which is followed by a pod delete/create cycle like the following:

      in the local environment I used to reproduce the case

      • 1439 Deployments (quay.io/cloud-bulldozer/nginx:latest) and services pointing to the port 8080 of those pods were created
      • Then the pods from these Deployments pods are deleted using `oc delete pod -A -l kube-burner=perf-tests`
      • The script waits for them to be up & running again, but this operation gets stuck after some cycles (sometimes during the first one)
      • Some pods doesn't manage to start because their probes fails because of network failures

      This issue is impacting one of our customers, more info at https://access.redhat.com/support/cases/#/case/03868814

       

      Attaching some traces below:

       

      # Some pods are in CrashLoopBackOff state
      [root@m42-h01-000-r760 rsevilla]# oc get pod -A -o wide | grep -i crash
      ichp-kubelet-density-1258                          nginx-1-58d54644f9-b42m2                                     0/1     CrashLoopBackOff   23 (57s ago)   64m     10.130.2.228     m42-h15-000-r760   <none>           <none>
      ichp-kubelet-density-379                           nginx-1-58d54644f9-b5l5t                                     0/1     CrashLoopBackOff   23 (21s ago)   63m     10.130.2.171     m42-h15-000-r760   <none>           <none>
      ichp-kubelet-density-43                            nginx-1-58d54644f9-8wsnn                                     0/1     CrashLoopBackOff   23 (29s ago)   63m     10.130.2.197     m42-h15-000-r760   <none>           <none>
      ichp-kubelet-density-748                           nginx-1-58d54644f9-l6fl7                                     0/1     CrashLoopBackOff   23 (33s ago)   63m     10.128.2.66      m42-h19-000-r760   <none>           <none>
      ichp-kubelet-density-870                           nginx-1-58d54644f9-j29ln                                     0/1     CrashLoopBackOff   23 (51s ago)   63m     10.128.2.183     m42-h19-000-r760   <none>           <none>
      
      
      # Events from one of the pods
      
      # oc describe pod -n ichp-kubelet-density-379 nginx-1-58d54644f9-b5l5t
      Events:
        Type     Reason                  Age                   From               Message
        ----     ------                  ----                  ----               -------
        Normal   Scheduled               64m                   default-scheduler  Successfully assigned ichp-kubelet-density-379/nginx-1-58d54644f9-b5l5t to m42-h15-000-r760                                                                      
        Warning  FailedCreatePodSandBox  63m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(42fd1c8f921528b7316a349cea255d8d03df00eacd742d3343c79b28fd366324): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF
        Warning  FailedCreatePodSandBox  63m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(4aca9932d378a4c0515722a25f19cc5be2ff1b241daf07c612ec127c05d1324b): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: no such file or directory                                                               
        Normal   AddedInterface          63m                   multus             Add eth0 [10.130.2.171/23] from ovn-kubernetes                                                                                                                   
        Warning  Unhealthy               61m (x3 over 62m)     kubelet            Readiness probe failed: Get "http://10.130.2.171:8080/": dial tcp 10.130.2.171:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)                
        Warning  Unhealthy               61m (x6 over 62m)     kubelet            Liveness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                               
        Normal   Killing                 61m                   kubelet            Container netty failed liveness probe, will be restarted                                                                                                         
        Normal   Pulled                  61m (x2 over 62m)     kubelet            Container image "quay.io/cloud-bulldozer/nginx:latest" already present on machine                                                                                
        Normal   Created                 61m (x2 over 62m)     kubelet            Created container netty
        Normal   Started                 61m (x2 over 62m)     kubelet            Started container netty
        Warning  Unhealthy               8m1s (x103 over 62m)  kubelet            Readiness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                              
        Warning  BackOff                 3m2s (x215 over 58m)  kubelet            Back-off restarting failed container netty in pod nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379(75905e1a-8d46-4785-ae8f-07927c6d7571) 

       

      When the ovnkube-node pod running in the node of one of the failing pods gets restarted, the pods in that node eventually manage to star as demonstrated below:

       

      # One of the pods is in CrashLoopBackoff state
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
      NAME                       READY   STATUS             RESTARTS       AGE    IP             NODE               NOMINATED NODE   READINESS GATES                                                                                               
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (41s ago)   135m   10.130.2.228   m42-h15-000-r760   <none>           <none>            
      # Restart ovnkube-node pod running in that node
      [root@m42-h01-000-r760 network_logs]# oc get pod -o wide | grep m42-h15-000-r760                                                                                                                                                             
      ovnkube-node-x6lvh                       8/8     Running   55 (140m ago)   26h   192.168.216.16   m42-h15-000-r760   <none>           <none>                                                                                                 
      [root@m42-h01-000-r760 network_logs]# oc delete pod ovnkube-node-x6lvh                                                                                                                                                                       
      pod "ovnkube-node-x6lvh" deleted                                                                                                                                                                                                             
      .
      .
      .
      # Pod eventually manages to start when backoff period  (5 minutes) + liveness probe are met
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
      NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m19s ago)   138m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
      NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m21s ago)   138m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
      NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m34s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
      NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m54s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      [root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide -w
      NAME                       READY   STATUS             RESTARTS        AGE    IP             NODE               NOMINATED NODE   READINESS GATES
      nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (5m3s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      nginx-1-58d54644f9-b42m2   0/1     Running            46 (5m6s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
      nginx-1-58d54644f9-b42m2   1/1     Running            46 (5m17s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none> 

       

        1. nginx.yml
          2 kB
        2. openshift-ovn-kubernetes.tgz
          93.45 MB
        3. ovn-databases.tgz
          2.66 MB
        4. ovnkube-node-w25sv.tar.xz
          93.81 MB
        5. ovnkube-node-w25sv.tar-1.xz
          93.81 MB

              anusaxen Anurag Saxena
              rsevilla@redhat.com Raul Sevilla Canavate
              None
              None
              Raul Sevilla Canavate Raul Sevilla Canavate
              None
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: