Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.8.z
Component/s: Node / Kubelet
Labels:
None

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.8.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

+++ This bug was initially created as a clone of Bug #2106414 +++

+++ This bug was initially created as a clone of Bug #2065749 +++

Description of problem:

Over time the kubelet slowly consumes memory until, at some point, pods are no longer able to start on the node; coinciding with this are container runtime errors. It appears that even rebooting the node does not resolve the issue once it occurs - the node has to be completely rebuilt.

How reproducible: Consistently

Actual results: Pods are eventually unable to start on the node; rebuilding the node is the only workaround

Expected results: kubelet/crio would continue working as expected

— Additional comment from lstanton@redhat.com on 2022-03-18 17:12:39 UTC —

OVERVIEW
========

Ford is having an ongoing issue where they seeing node memory using slowly increasing over time until pods are unable to start, and errors like the following start to show up:

"Error: relabel failed /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: lsetxattr /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: operation not permitted"

Once this happens the node is unusable and has to be rebuilt, as rebooting apparently does not solve the problem.

We originally thought that this was going to get resolved in BZ 2042175 which was eventually closed as a duplicate of BZ 2060494. Since BZ 2044438 represented the 4.9 fix (in 4.9.21, errata ~~RHBA-2022~~:0488), we told customer to go ahead and upgrade to 4.9.21, but unfortunately this did not resolve the issue for them, hence the reason we are opening this BZ.

Ford has a very specific use case for OC -. they are mostly using the cluster to run large numbers of Tekton pipelines and are using some custom SCC's to get around current limitations in OCP filesystem support.

— Additional comment from lstanton@redhat.com on 2022-03-18 17:20:39 UTC —

There is a lot of data available on case 2065749, though this is before they upgraded from 4.8 to 4.9. If there's any specific data that we haven't captured yet but would be helpful please let me know.

— Additional comment from lstanton@redhat.com on 2022-03-18 19:08:44 UTC —

The latest data from an occurrence before the customer upgraded to 4.9.21 can be found here (so this is from 4.8):

— sosreport —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/057aead6-71c7-47ad-8faf-b0ba437625f3

— Other data (including pprof) —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/b384715b-4189-4f34-bd85-02a718b1314a

Would it possible to get your opinion on the above node data? Does anything look obviously out of line in terms of kubelet or crio behavior that might explain pods failing to start? I've requested newer data in the case.

— Additional comment from rphillips@redhat.com on 2022-03-28 14:47:34 UTC —

1. Kubelet pprof

go tool pprof pd103-7h7tj-worker-c-2qh7x-profile.pprof.gz
(pprof) top
560ms 17.78% 17.78% 570ms 18.10% syscall.Syscall
510ms 16.19% 33.97% 520ms 16.51% syscall.Syscall6
190ms 6.03% 40.00% 190ms 6.03% runtime.epollctl
130ms 4.13% 44.13% 270ms 8.57% runtime.scanobject
110ms 3.49% 47.62% 350ms 11.11% runtime.mallocgc
110ms 3.49% 51.11% 110ms 3.49% runtime.markBits.isMarked (inline)
80ms 2.54% 53.65% 80ms 2.54% aeshashbody
80ms 2.54% 56.19% 80ms 2.54% runtime.epollwait
70ms 2.22% 58.41% 70ms 2.22% runtime.futex
60ms 1.90% 60.32% 70ms 2.22% runtime.heapBitsSetType

This dump shows the kubelet is a state of constant GC. scanobject taking 270ms is high. Syscall6 is LSTAT taking 520ms is

1. Kubelet

I noticed that pods are failing to be fully cleaned up and created within the kubelet: Failed to remove cgroup (will retry)

This means the pods are staying within Kubelet's memory and the kubelet is retrying the cleanup operation. (Effectively leaking memory).

Additionally, pods are failing to be started from Comment #1, after during the start phase with the selinux rename. The failure in the start phase leads me to believe we are hitting (1) (not backported yet into 4.9). PR 107845 fixes an issue where pods on CRI error are arbitrarily marked as terminated when they should be marked as waiting.

1. https://github.com/kubernetes/kubernetes/pull/107845 .

— Additional comment from rphillips@redhat.com on 2022-03-28 15:40:19 UTC —

What are the custom SCCs? Fixing the issue of the relabel is the first step in solving this.

— Additional comment from rphillips@redhat.com on 2022-03-29 14:31:26 UTC —

Upstream leak https://github.com/kubernetes/kubernetes/pull/109103. The pprof on this BZ shows the same issue.

I'll attach the screenshot.

— Additional comment from rphillips@redhat.com on 2022-03-29 14:32:06 UTC —

Created attachment 1869023
leak pprof

— Additional comment from lstanton@redhat.com on 2022-04-05 15:02:09 UTC —

@rphillips@redhat.com do you still need info from me?

— Additional comment from rphillips@redhat.com on 2022-04-13 14:08:15 UTC —

No thank you!

— Additional comment from hshukla@redhat.com on 2022-04-25 09:37:55 UTC —

Created attachment 1874784
pprof from 3 nodes for Swisscom case 03125278

— Additional comment from rphillips@redhat.com on 2022-04-25 13:50:14 UTC —

@Hradayesh: CU has the same leak as this BZ.

❯ go tool pprof w0-d-altais.corproot.net-heap.pprof
File: kubelet
Build ID: 6eb513a78ba65574e291855722d4efa0a3fc9c23
Type: inuse_space
Time: Apr 25, 2022 at 3:49am (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 54.37MB, 75.57% of 71.94MB total
Showing top 10 nodes out of 313
flat flat% sum% cum cum%
26.56MB 36.92% 36.92% 26.56MB 36.92% k8s.io/kubernetes/pkg/kubelet/cm/containermap.ContainerMap.Add
10.50MB 14.60% 51.52% 10.50MB 14.60% k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2.(*CreateContainerResponse).Unmarshal
5.05MB 7.02% 58.53% 5.05MB 7.02% k8s.io/kubernetes/pkg/kubelet/volumemanager/populator.(*desiredStateOfWorldPopulator).markPodProcessed
3.50MB 4.87% 63.40% 4MB 5.56% k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
2.50MB 3.48% 66.88% 4.01MB 5.57% k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/libcontainer.newContainerStats
1.50MB 2.09% 68.97% 2MB 2.78% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).DeepCopy
1.50MB 2.09% 71.05% 1.50MB 2.09% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Container).Unmarshal
1.13MB 1.57% 72.62% 1.13MB 1.57% k8s.io/kubernetes/vendor/google.golang.org/protobuf/internal/strs.(*Builder).MakeString
1.12MB 1.55% 74.18% 1.62MB 2.25% k8s.io/kubernetes/vendor/github.com/golang/groupcache/lru.(*Cache).Add
1MB 1.39% 75.57% 1MB 1.39% k8s.io/kubernetes/vendor/github.com/aws/aws-sdk-go/aws/endpoints.init

— Additional comment from bugzilla@redhat.com on 2022-05-09 08:31:37 UTC —

Account disabled by LDAP Audit for extended failure

— Additional comment from bsmitley@redhat.com on 2022-05-24 15:17:47 UTC —

Is there any ETA on getting a fix for this? This issue is happening alot on Ford's Tekton pods.

— Additional comment from dahernan@redhat.com on 2022-06-02 14:35:33 UTC —

Once this is verified, is it viable to also backport it to 4.8, 4.9 or 4.10(at least)? I do not observe any cherry pick or backport related to other versions (yet) but we have a customer (Swisscom) pushing for that, as this is impacting them.

— Additional comment from aos-team-art-private@redhat.com on 2022-06-09 06:10:29 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-06-13 13:13:11 UTC —

Bugfix included in accepted release 4.11.0-0.nightly-2022-06-11-054027
Bug will not be automatically moved to VERIFIED for the following reasons:

PR openshift/kubernetes#1229 not approved by QA contact

This bug must now be manually moved to VERIFIED by schoudha@redhat.com

— Additional comment from schoudha@redhat.com on 2022-06-15 15:21:28 UTC —

Checked on 4.11.0-0.nightly-2022-06-14-172335 by running pods over a day and don't see unexpectedly high memory usage by kubelet on node.

% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-14-172335 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-14-172335

— Additional comment from errata-xmlrpc@redhat.com on 2022-06-15 17:57:08 UTC —

This bug has been added to advisory RHEA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

— Additional comment from bsmitley@redhat.com on 2022-06-27 20:34:28 UTC —

Do we have a timeframe for this into 4.10?

This issue is impacting Ford.

Account name: Ford
Account Number: 5561914
TAM customer: yes
CSE customer: yes
Strategic: yes

— Additional comment from bsmitley@redhat.com on 2022-06-28 15:28:55 UTC —

Ford brought this up in my TAM meeting as a hot issue.

I need an ETA of when this will be in 4.10. This just needs to be a rough date. I know normally we don't share exact dates because timeframes could change.

depends on

OCPBUGS-1462 [2106655] Kubelet slowly leaking memory and pods eventually unable to start

Closed

links to

openshift/kubernetes#1371: [release-4.8] OCPBUGS-1461: UPSTREAM: 109103: cpu/memory manager containerMap memory leak

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates