Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Scale

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
SDLC stage when should've been found:
None

While running a load test in a bare metal cluster, some pods got stuck in CrashLoopBackOff state because some of their liveness and readiness probes fail.

The pods in question are simple http servers (nginx) and these probes are the typical httpGet probes pointing to the endpoint /

The load test is executed under the following conditions, 4.14.27 cluster with 6 nodes, (3 workers and 3 masters) with maxPods configured to 500, and using OVNKubernetes in its IC fashion.

The benchmark fills with pods the worker nodes which is followed by a pod delete/create cycle like the following:

in the local environment I used to reproduce the case

1439 Deployments (quay.io/cloud-bulldozer/nginx:latest) and services pointing to the port 8080 of those pods were created
Then the pods from these Deployments pods are deleted using `oc delete pod -A -l kube-burner=perf-tests`
The script waits for them to be up & running again, but this operation gets stuck after some cycles (sometimes during the first one)
Some pods doesn't manage to start because their probes fails because of network failures

This issue is impacting one of our customers, more info at https://access.redhat.com/support/cases/#/case/03868814

Attaching some traces below:

# Some pods are in CrashLoopBackOff state
[root@m42-h01-000-r760 rsevilla]# oc get pod -A -o wide | grep -i crash
ichp-kubelet-density-1258                          nginx-1-58d54644f9-b42m2                                     0/1     CrashLoopBackOff   23 (57s ago)   64m     10.130.2.228     m42-h15-000-r760   <none>           <none>
ichp-kubelet-density-379                           nginx-1-58d54644f9-b5l5t                                     0/1     CrashLoopBackOff   23 (21s ago)   63m     10.130.2.171     m42-h15-000-r760   <none>           <none>
ichp-kubelet-density-43                            nginx-1-58d54644f9-8wsnn                                     0/1     CrashLoopBackOff   23 (29s ago)   63m     10.130.2.197     m42-h15-000-r760   <none>           <none>
ichp-kubelet-density-748                           nginx-1-58d54644f9-l6fl7                                     0/1     CrashLoopBackOff   23 (33s ago)   63m     10.128.2.66      m42-h19-000-r760   <none>           <none>
ichp-kubelet-density-870                           nginx-1-58d54644f9-j29ln                                     0/1     CrashLoopBackOff   23 (51s ago)   63m     10.128.2.183     m42-h19-000-r760   <none>           <none>


# Events from one of the pods

# oc describe pod -n ichp-kubelet-density-379 nginx-1-58d54644f9-b5l5t
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               64m                   default-scheduler  Successfully assigned ichp-kubelet-density-379/nginx-1-58d54644f9-b5l5t to m42-h15-000-r760                                                                      
  Warning  FailedCreatePodSandBox  63m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(42fd1c8f921528b7316a349cea255d8d03df00eacd742d3343c79b28fd366324): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF
  Warning  FailedCreatePodSandBox  63m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(4aca9932d378a4c0515722a25f19cc5be2ff1b241daf07c612ec127c05d1324b): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: no such file or directory                                                               
  Normal   AddedInterface          63m                   multus             Add eth0 [10.130.2.171/23] from ovn-kubernetes                                                                                                                   
  Warning  Unhealthy               61m (x3 over 62m)     kubelet            Readiness probe failed: Get "http://10.130.2.171:8080/": dial tcp 10.130.2.171:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)                
  Warning  Unhealthy               61m (x6 over 62m)     kubelet            Liveness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                               
  Normal   Killing                 61m                   kubelet            Container netty failed liveness probe, will be restarted                                                                                                         
  Normal   Pulled                  61m (x2 over 62m)     kubelet            Container image "quay.io/cloud-bulldozer/nginx:latest" already present on machine                                                                                
  Normal   Created                 61m (x2 over 62m)     kubelet            Created container netty
  Normal   Started                 61m (x2 over 62m)     kubelet            Started container netty
  Warning  Unhealthy               8m1s (x103 over 62m)  kubelet            Readiness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                              
  Warning  BackOff                 3m2s (x215 over 58m)  kubelet            Back-off restarting failed container netty in pod nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379(75905e1a-8d46-4785-ae8f-07927c6d7571)

When the ovnkube-node pod running in the node of one of the failing pods gets restarted, the pods in that node eventually manage to star as demonstrated below:

# One of the pods is in CrashLoopBackoff state
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME                       READY   STATUS             RESTARTS       AGE    IP             NODE               NOMINATED NODE   READINESS GATES                                                                                               
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (41s ago)   135m   10.130.2.228   m42-h15-000-r760   <none>           <none>            
# Restart ovnkube-node pod running in that node
[root@m42-h01-000-r760 network_logs]# oc get pod -o wide | grep m42-h15-000-r760                                                                                                                                                             
ovnkube-node-x6lvh                       8/8     Running   55 (140m ago)   26h   192.168.216.16   m42-h15-000-r760   <none>           <none>                                                                                                 
[root@m42-h01-000-r760 network_logs]# oc delete pod ovnkube-node-x6lvh                                                                                                                                                                       
pod "ovnkube-node-x6lvh" deleted                                                                                                                                                                                                             
.
.
.
# Pod eventually manages to start when backoff period  (5 minutes) + liveness probe are met
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m19s ago)   138m   10.130.2.228   m42-h15-000-r760   <none>           <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m21s ago)   138m   10.130.2.228   m42-h15-000-r760   <none>           <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m34s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME                       READY   STATUS             RESTARTS         AGE    IP             NODE               NOMINATED NODE   READINESS GATES
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (4m54s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide -w
NAME                       READY   STATUS             RESTARTS        AGE    IP             NODE               NOMINATED NODE   READINESS GATES
nginx-1-58d54644f9-b42m2   0/1     CrashLoopBackOff   45 (5m3s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
nginx-1-58d54644f9-b42m2   0/1     Running            46 (5m6s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>
nginx-1-58d54644f9-b42m2   1/1     Running            46 (5m17s ago)   139m   10.130.2.228   m42-h15-000-r760   <none>           <none>

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

nginx.yml
2 kB
2024/08/02 9:21 AM
openshift-ovn-kubernetes.tgz
93.45 MB
2024/08/01 3:50 PM
ovn-databases.tgz
2.66 MB
2024/08/01 3:24 PM
ovnkube-node-w25sv.tar.xz
93.81 MB
2024/09/03 11:04 AM
ovnkube-node-w25sv.tar-1.xz
93.81 MB
2024/09/03 11:04 AM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide