Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.9
Component/s: Performance Addon Operator
Labels:
- migrated_from_bz

Activity Type:
Quality / Stability / Reliability
Blocked:
None
Blocked Reason:
None
Story Points:
None
Severity:
Moderate
Regression:
None
Architecture:

Unspecified

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
If docs needed, set a value
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Execution of 500 pods (418 workload pods) fails after some continuous churn when PAO and rt-kernel is installed on an SNO

Version-Release number of selected component (if applicable):

4.18.0-305.30.1.rt7.102.el8_4.x86_64

OCP version - 4.9.17

How reproducible:
Workload of 418 pods on the SNO with rt-kernel installed starts failing with mulpile pods getting stuck in "CreateContainerError" state (ContextDeadlineExceeded from pod events).

Details:
1. Ran 3 iterations of creation and cleanup of 168 workload pods on the cluster without any issues
2. Next, started with 3 iterations of creation and cleanup of 418 workload pods on the cluster and it failed with a large number of pods stuck in "CreateContainerError " state ("ContextDeadlineExceeded" error from pod events)
3. Node memory utilization went over 85%:
[root@nchhabra-baremetal01 nchhabra-baremetal03]# oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
nchhabra-baremetal03 15715m 17% 98461Mi 88%

Same workload runs without any issues on an SNO with same OCP version but without PAO and RT-kernel.

Workload pods are seen to consume only 6-7MiB of memory per pod and the node has about 99GiB of available memory to run workload pods(after 16GiB allocated to HugePages).

Each successful iteration of the same workload is seen to utilize around 30 GiB of available memory, but when the iteration fails, node memory seems to deplete faster until it eventually crashes and has to be restored with a reboot.

top - 16:40:49 up 18:47, 1 user, load average: 848.89, 847.92, 771.18
Tasks: 8100 total, 18 running, 8072 sleeping, 0 stopped, 10 zombie
%Cpu(s): 5.1 us, 93.3 sy, 0.0 ni, 1.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 128580.7 total, 380.3 free, 125627.5 used, 2572.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1671.2 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76892 root 20 0 8493728 175624 0 S 2864 0.1 284:42.93 openshift-apise
81384 root 20 0 7355264 51224 0 S 917.5 0.0 116:36.12 oauth-apiserver
70669 nfsnobo+ 20 0 5949932 2.8g 0 S 846.1 2.2 260:22.88 prometheus
76519 root 20 0 5499264 30648 0 D 755.0 0.0 76:23.81 coredns

Steps to Reproduce:
1.
2.
3.

Actual results:
Workload failure with multiple pods stuck in Container Creation Error state. Memory of the node depleting with the above mentioned workload until the node crashes.

Expected results:
418 workload pods (500 total pods) should run without memory issues on an SNO with RT-kernel installed

Additional info:

Stopped the workload before the available memory went all the way down to zero to collect journalctl logs. Must-gather logs also collected post deletion of workload, as it wasn't successful with workload on.

Assignee:: Martin Sivak

Reporter:: Noreen Chhabra

Need Info From:: Brent Rowsell

Contributors:: None

QA Contact:: Gowrishankar Rajaiyan

Doc Contact:: None

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/01/26 1:54 AM

Updated:: 2025/07/27 5:45 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide