-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
-
None
While running a load test in a bare metal cluster, some pods got stuck in CrashLoopBackOff state because some of their liveness and readiness probes fail.
The pods in question are simple http servers (nginx) and these probes are the typical httpGet probes pointing to the endpoint /
The load test is executed under the following conditions, 4.14.27 cluster with 6 nodes, (3 workers and 3 masters) with maxPods configured to 500, and using OVNKubernetes in its IC fashion.
The benchmark fills with pods the worker nodes which is followed by a pod delete/create cycle like the following:
in the local environment I used to reproduce the case
- 1439 Deployments (quay.io/cloud-bulldozer/nginx:latest) and services pointing to the port 8080 of those pods were created
- Then the pods from these Deployments pods are deleted using `oc delete pod -A -l kube-burner=perf-tests`
- The script waits for them to be up & running again, but this operation gets stuck after some cycles (sometimes during the first one)
- Some pods doesn't manage to start because their probes fails because of network failures
This issue is impacting one of our customers, more info at https://access.redhat.com/support/cases/#/case/03868814
Attaching some traces below:
# Some pods are in CrashLoopBackOff state [root@m42-h01-000-r760 rsevilla]# oc get pod -A -o wide | grep -i crash ichp-kubelet-density-1258 nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 23 (57s ago) 64m 10.130.2.228 m42-h15-000-r760 <none> <none> ichp-kubelet-density-379 nginx-1-58d54644f9-b5l5t 0/1 CrashLoopBackOff 23 (21s ago) 63m 10.130.2.171 m42-h15-000-r760 <none> <none> ichp-kubelet-density-43 nginx-1-58d54644f9-8wsnn 0/1 CrashLoopBackOff 23 (29s ago) 63m 10.130.2.197 m42-h15-000-r760 <none> <none> ichp-kubelet-density-748 nginx-1-58d54644f9-l6fl7 0/1 CrashLoopBackOff 23 (33s ago) 63m 10.128.2.66 m42-h19-000-r760 <none> <none> ichp-kubelet-density-870 nginx-1-58d54644f9-j29ln 0/1 CrashLoopBackOff 23 (51s ago) 63m 10.128.2.183 m42-h19-000-r760 <none> <none> # Events from one of the pods # oc describe pod -n ichp-kubelet-density-379 nginx-1-58d54644f9-b5l5t Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 64m default-scheduler Successfully assigned ichp-kubelet-density-379/nginx-1-58d54644f9-b5l5t to m42-h15-000-r760 Warning FailedCreatePodSandBox 63m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(42fd1c8f921528b7316a349cea255d8d03df00eacd742d3343c79b28fd366324): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF Warning FailedCreatePodSandBox 63m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379_75905e1a-8d46-4785-ae8f-07927c6d7571_0(4aca9932d378a4c0515722a25f19cc5be2ff1b241daf07c612ec127c05d1324b): error adding pod ichp-kubelet-density-379_nginx-1-58d54644f9-b5l5t to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: no such file or directory Normal AddedInterface 63m multus Add eth0 [10.130.2.171/23] from ovn-kubernetes Warning Unhealthy 61m (x3 over 62m) kubelet Readiness probe failed: Get "http://10.130.2.171:8080/": dial tcp 10.130.2.171:8080: i/o timeout (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 61m (x6 over 62m) kubelet Liveness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 61m kubelet Container netty failed liveness probe, will be restarted Normal Pulled 61m (x2 over 62m) kubelet Container image "quay.io/cloud-bulldozer/nginx:latest" already present on machine Normal Created 61m (x2 over 62m) kubelet Created container netty Normal Started 61m (x2 over 62m) kubelet Started container netty Warning Unhealthy 8m1s (x103 over 62m) kubelet Readiness probe failed: Get "http://10.130.2.171:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Warning BackOff 3m2s (x215 over 58m) kubelet Back-off restarting failed container netty in pod nginx-1-58d54644f9-b5l5t_ichp-kubelet-density-379(75905e1a-8d46-4785-ae8f-07927c6d7571)
When the ovnkube-node pod running in the node of one of the failing pods gets restarted, the pods in that node eventually manage to star as demonstrated below:
# One of the pods is in CrashLoopBackoff state
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (41s ago) 135m 10.130.2.228 m42-h15-000-r760 <none> <none>
# Restart ovnkube-node pod running in that node
[root@m42-h01-000-r760 network_logs]# oc get pod -o wide | grep m42-h15-000-r760
ovnkube-node-x6lvh 8/8 Running 55 (140m ago) 26h 192.168.216.16 m42-h15-000-r760 <none> <none>
[root@m42-h01-000-r760 network_logs]# oc delete pod ovnkube-node-x6lvh
pod "ovnkube-node-x6lvh" deleted
.
.
.
# Pod eventually manages to start when backoff period (5 minutes) + liveness probe are met
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (4m19s ago) 138m 10.130.2.228 m42-h15-000-r760 <none> <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (4m21s ago) 138m 10.130.2.228 m42-h15-000-r760 <none> <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (4m34s ago) 139m 10.130.2.228 m42-h15-000-r760 <none> <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (4m54s ago) 139m 10.130.2.228 m42-h15-000-r760 <none> <none>
[root@m42-h01-000-r760 network_logs]# oc get pod -n ichp-kubelet-density-1258 -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-1-58d54644f9-b42m2 0/1 CrashLoopBackOff 45 (5m3s ago) 139m 10.130.2.228 m42-h15-000-r760 <none> <none>
nginx-1-58d54644f9-b42m2 0/1 Running 46 (5m6s ago) 139m 10.130.2.228 m42-h15-000-r760 <none> <none>
nginx-1-58d54644f9-b42m2 1/1 Running 46 (5m17s ago) 139m 10.130.2.228 m42-h15-000-r760 <none> <none>