-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.9
-
Quality / Stability / Reliability
-
None
-
None
-
None
-
Moderate
-
None
-
Unspecified
-
None
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
None
-
None
-
None
-
None
-
None
Description of problem:
Execution of 500 pods (418 workload pods) fails after some continuous churn when PAO and rt-kernel is installed on an SNO
Version-Release number of selected component (if applicable):
4.18.0-305.30.1.rt7.102.el8_4.x86_64
OCP version - 4.9.17
How reproducible:
Workload of 418 pods on the SNO with rt-kernel installed starts failing with mulpile pods getting stuck in "CreateContainerError" state (ContextDeadlineExceeded from pod events).
Details:
1. Ran 3 iterations of creation and cleanup of 168 workload pods on the cluster without any issues
2. Next, started with 3 iterations of creation and cleanup of 418 workload pods on the cluster and it failed with a large number of pods stuck in "CreateContainerError " state ("ContextDeadlineExceeded" error from pod events)
3. Node memory utilization went over 85%:
[root@nchhabra-baremetal01 nchhabra-baremetal03]# oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
nchhabra-baremetal03 15715m 17% 98461Mi 88%
Same workload runs without any issues on an SNO with same OCP version but without PAO and RT-kernel.
Workload pods are seen to consume only 6-7MiB of memory per pod and the node has about 99GiB of available memory to run workload pods(after 16GiB allocated to HugePages).
Each successful iteration of the same workload is seen to utilize around 30 GiB of available memory, but when the iteration fails, node memory seems to deplete faster until it eventually crashes and has to be restored with a reboot.
top - 16:40:49 up 18:47, 1 user, load average: 848.89, 847.92, 771.18
Tasks: 8100 total, 18 running, 8072 sleeping, 0 stopped, 10 zombie
%Cpu(s): 5.1 us, 93.3 sy, 0.0 ni, 1.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 128580.7 total, 380.3 free, 125627.5 used, 2572.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1671.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76892 root 20 0 8493728 175624 0 S 2864 0.1 284:42.93 openshift-apise
81384 root 20 0 7355264 51224 0 S 917.5 0.0 116:36.12 oauth-apiserver
70669 nfsnobo+ 20 0 5949932 2.8g 0 S 846.1 2.2 260:22.88 prometheus
76519 root 20 0 5499264 30648 0 D 755.0 0.0 76:23.81 coredns
Steps to Reproduce:
1.
2.
3.
Actual results:
Workload failure with multiple pods stuck in Container Creation Error state. Memory of the node depleting with the above mentioned workload until the node crashes.
Expected results:
418 workload pods (500 total pods) should run without memory issues on an SNO with RT-kernel installed
Additional info:
Stopped the workload before the available memory went all the way down to zero to collect journalctl logs. Must-gather logs also collected post deletion of workload, as it wasn't successful with workload on.