-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.9
-
Quality / Stability / Reliability
-
None
-
None
-
None
-
Moderate
-
None
-
Unspecified
-
None
-
None
-
Rejected
-
None
-
None
-
If docs needed, set a value
-
None
-
None
-
None
-
None
-
None
Description of problem:
occasionally many pods in a worker node got stuck in 'ContainerCreating' status on OCP 4.9.1.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@bastion1 ~]# oc get po -n oam -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
acpf11067-cip1-8486656df9-mhzns 1/1 Running 0 6h11m fd01:0:0:6::59 worker03.ss2.host.local <none> <none>
aupf11067-cmp1-7548d577c-qqznc 0/1 ContainerCreating 0 5h59m <none> worker02.ss2.host.local <none> <none>
aupf11067-dmp0-856dbcbf9-d6j8d 0/1 ContainerCreating 0 5h19m <none>
..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When this happens, the node become unstable, being unable to collect sosreport somtimes.
It happened on a node, and later on another node.
Rebooting the node can solve the issue.
From journel, you see lots of the following error when the issue happens
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dec 10 01:17:16 worker02.ss2.host.local hyperkube[4111]: I1210 01:17:16.011373 4111 pod_container_manager_linux.go:194] "Failed to delete cgroup paths" cgroupName=[kubepods pod39dcbeba-82ee-42b4-ae41-39dc7cdbad98] err="unable to destroy cgroup paths for cgroup [kubepods pod39dcbeba-82ee-42b4-ae41-39dc7cdbad98] : Timed out while waiting for systemd to remove kubepods-pod39dcbeba_82ee_42b4_ae41_39dc7cdbad98.slice"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It seems there are lots of kubelet process running at the time of occurrence.
$ grep kubelet ./sos_commands/process/ps_-elfL | grep 4111 | wc -l
7011
Version-Release number of selected component (if applicable):
OCP Service Version: 4.9.1
Kubernetes Version: v1.22.0-rc.0+ef241fd
How reproducible:
Currently we do not know the condition to reproduce.
Actual results:
Pod should be created and be Ready.
Expected results:
Additional info: