Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.10.z
Component/s: Node / CRI-O
Labels:
- crio

Severity:
Important
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:
PX Technical Impact:

Description of problem:

In Support Case 03419687, OpenShift Container Platform Worker Nodes became "NotReady" due to excessive "kmalloc-32" kernel memory allocations. Root cause was that CRI-O had thousands of threads running:

$ cat process/ps_-elfL | grep crio | wc -l
27168

"crictl" shows a lot of containers (mostly Cronjob containers with a very short lifetime) in "Exited" status:

cat crio/crictl_ps_-a | awk '{ print $6 }' | sort | uniq -c | sort -n
      1 ATTEMPT
      1 Created
    235 Running
  15930 Exited

However, despite those containers exiting, the threads seem to be running still. Rebooting the OpenShift Container Platform Worker Node resolves the issue, however the number of threads is then starting to go up again (due to the Cronjob workload being scheduled / exiting).

Kernel slab allocations showing 200GB of "kmalloc-32" being consumed (mostly by CRI-O):

$ cat proc/slabinfo | awk 'NR<3'; cat proc/slabinfo | awk 'NR>2 {print $0 " " $3*$4/1024/1024 " MB"}' | column -t | sort -k17nr | head -20
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-32                         7090848532  7090876032  32      128  1    :  tunables  0  0  0  :  slabdata  55397469  55397469  0  216396      MB
[..]

Version-Release number of selected component (if applicable):

$ cat crio/crictl_version 
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.23.3-20.rhaos4.10.git89344de.el8
RuntimeApiVersion:  v1alpha2
$ cat ../etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.10
$ cat rpmostree/rpm-ostree_status_-v
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ae72ecdcdc3ca093ab624c430370467414255eed2b00815fd921fda6849b1a4
              CustomOrigin: Managed by machine-config-operator
                   Version: 410.84.202211140926-0 (2022-11-14T09:29:01Z)
[..]

How reproducible:

Have not reproduced the issue on our side, customer has identified at least 4 nodes on different clusters (OpenShift Container Platform 4.10.42) that had the issue

Steps to Reproduce:

1. Run OpenShift Container Platform 4.10.42
2. Start many containers (>10k) 
3. Use "ps -elfL | grep crio | wc -l" to observe the number of CRI-O threads

Actual results:

Amount of threads is increasing significantly

Expected results:

Amount of threads stays roughly the same, no significant increase

Additional info:

Slack thread: https://redhat-internal.slack.com/archives/C02UD0TT3/p1674826603573669

sosreport is available here: https://drive.google.com/drive/folders/1TEyaYERPzuhRH2vr2HybKhsN3nccHDx-

Assignee:: Peter Hunt

Reporter:: Simon Krenger

QA Contact:: Sunil Choudhary

Need Info From:: Simon Krenger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/01/27 3:20 PM

Updated:: 2023/05/16 6:23 AM

Resolved:: 2023/05/16 6:23 AM

Details

Description

Attachments

Activity

People

Dates

Hide