Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1461

Kubelet slowly leaking memory and pods eventually unable to start


    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.8.z
    • Node / Kubelet
    • None
    • Important
    • None
    • False
    • Hide



      +++ This bug was initially created as a clone of Bug #2106414 +++

      +++ This bug was initially created as a clone of Bug #2065749 +++

      Description of problem:

      Over time the kubelet slowly consumes memory until, at some point, pods are no longer able to start on the node; coinciding with this are container runtime errors. It appears that even rebooting the node does not resolve the issue once it occurs - the node has to be completely rebuilt.

      How reproducible: Consistently

      Actual results: Pods are eventually unable to start on the node; rebuilding the node is the only workaround

      Expected results: kubelet/crio would continue working as expected

      — Additional comment from lstanton@redhat.com on 2022-03-18 17:12:39 UTC —


      • Ford is having an ongoing issue where they seeing node memory using slowly increasing over time until pods are unable to start, and errors like the following start to show up:

      "Error: relabel failed /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: lsetxattr /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: operation not permitted"

      • Once this happens the node is unusable and has to be rebuilt, as rebooting apparently does not solve the problem.
      • We originally thought that this was going to get resolved in BZ 2042175 which was eventually closed as a duplicate of BZ 2060494. Since BZ 2044438 represented the 4.9 fix (in 4.9.21, errata RHBA-2022:0488), we told customer to go ahead and upgrade to 4.9.21, but unfortunately this did not resolve the issue for them, hence the reason we are opening this BZ.
      • Ford has a very specific use case for OC -. they are mostly using the cluster to run large numbers of Tekton pipelines and are using some custom SCC's to get around current limitations in OCP filesystem support.

      — Additional comment from lstanton@redhat.com on 2022-03-18 17:20:39 UTC —

      There is a lot of data available on case 2065749, though this is before they upgraded from 4.8 to 4.9. If there's any specific data that we haven't captured yet but would be helpful please let me know.

      — Additional comment from lstanton@redhat.com on 2022-03-18 19:08:44 UTC —

      The latest data from an occurrence before the customer upgraded to 4.9.21 can be found here (so this is from 4.8):

      — sosreport —

      — Other data (including pprof) —

      Would it possible to get your opinion on the above node data? Does anything look obviously out of line in terms of kubelet or crio behavior that might explain pods failing to start? I've requested newer data in the case.

      — Additional comment from rphillips@redhat.com on 2022-03-28 14:47:34 UTC —

        1. Kubelet pprof
      1. go tool pprof pd103-7h7tj-worker-c-2qh7x-profile.pprof.gz
        (pprof) top
        560ms 17.78% 17.78% 570ms 18.10% syscall.Syscall
        510ms 16.19% 33.97% 520ms 16.51% syscall.Syscall6
        190ms 6.03% 40.00% 190ms 6.03% runtime.epollctl
        130ms 4.13% 44.13% 270ms 8.57% runtime.scanobject
        110ms 3.49% 47.62% 350ms 11.11% runtime.mallocgc
        110ms 3.49% 51.11% 110ms 3.49% runtime.markBits.isMarked (inline)
        80ms 2.54% 53.65% 80ms 2.54% aeshashbody
        80ms 2.54% 56.19% 80ms 2.54% runtime.epollwait
        70ms 2.22% 58.41% 70ms 2.22% runtime.futex
        60ms 1.90% 60.32% 70ms 2.22% runtime.heapBitsSetType

      This dump shows the kubelet is a state of constant GC. scanobject taking 270ms is high. Syscall6 is LSTAT taking 520ms is

        1. Kubelet

      I noticed that pods are failing to be fully cleaned up and created within the kubelet: Failed to remove cgroup (will retry)

      This means the pods are staying within Kubelet's memory and the kubelet is retrying the cleanup operation. (Effectively leaking memory).

      Additionally, pods are failing to be started from Comment #1, after during the start phase with the selinux rename. The failure in the start phase leads me to believe we are hitting (1) (not backported yet into 4.9). PR 107845 fixes an issue where pods on CRI error are arbitrarily marked as terminated when they should be marked as waiting.

      1. https://github.com/kubernetes/kubernetes/pull/107845 .

      — Additional comment from rphillips@redhat.com on 2022-03-28 15:40:19 UTC —

      What are the custom SCCs? Fixing the issue of the relabel is the first step in solving this.

      — Additional comment from rphillips@redhat.com on 2022-03-29 14:31:26 UTC —

      Upstream leak https://github.com/kubernetes/kubernetes/pull/109103. The pprof on this BZ shows the same issue.

      I'll attach the screenshot.

      — Additional comment from rphillips@redhat.com on 2022-03-29 14:32:06 UTC —

      Created attachment 1869023
      leak pprof

      — Additional comment from lstanton@redhat.com on 2022-04-05 15:02:09 UTC —

      @rphillips@redhat.com do you still need info from me?

      — Additional comment from rphillips@redhat.com on 2022-04-13 14:08:15 UTC —

      No thank you!

      — Additional comment from hshukla@redhat.com on 2022-04-25 09:37:55 UTC —

      Created attachment 1874784
      pprof from 3 nodes for Swisscom case 03125278

      — Additional comment from rphillips@redhat.com on 2022-04-25 13:50:14 UTC —

      @Hradayesh: CU has the same leak as this BZ.

      ❯ go tool pprof w0-d-altais.corproot.net-heap.pprof
      File: kubelet
      Build ID: 6eb513a78ba65574e291855722d4efa0a3fc9c23
      Type: inuse_space
      Time: Apr 25, 2022 at 3:49am (CDT)
      Entering interactive mode (type "help" for commands, "o" for options)
      (pprof) top
      Showing nodes accounting for 54.37MB, 75.57% of 71.94MB total
      Showing top 10 nodes out of 313
      flat flat% sum% cum cum%
      26.56MB 36.92% 36.92% 26.56MB 36.92% k8s.io/kubernetes/pkg/kubelet/cm/containermap.ContainerMap.Add
      10.50MB 14.60% 51.52% 10.50MB 14.60% k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2.(*CreateContainerResponse).Unmarshal
      5.05MB 7.02% 58.53% 5.05MB 7.02% k8s.io/kubernetes/pkg/kubelet/volumemanager/populator.(*desiredStateOfWorldPopulator).markPodProcessed
      3.50MB 4.87% 63.40% 4MB 5.56% k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
      2.50MB 3.48% 66.88% 4.01MB 5.57% k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/libcontainer.newContainerStats
      1.50MB 2.09% 68.97% 2MB 2.78% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).DeepCopy
      1.50MB 2.09% 71.05% 1.50MB 2.09% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Container).Unmarshal
      1.13MB 1.57% 72.62% 1.13MB 1.57% k8s.io/kubernetes/vendor/google.golang.org/protobuf/internal/strs.(*Builder).MakeString
      1.12MB 1.55% 74.18% 1.62MB 2.25% k8s.io/kubernetes/vendor/github.com/golang/groupcache/lru.(*Cache).Add
      1MB 1.39% 75.57% 1MB 1.39% k8s.io/kubernetes/vendor/github.com/aws/aws-sdk-go/aws/endpoints.init

      — Additional comment from bugzilla@redhat.com on 2022-05-09 08:31:37 UTC —

      Account disabled by LDAP Audit for extended failure

      — Additional comment from bsmitley@redhat.com on 2022-05-24 15:17:47 UTC —

      Is there any ETA on getting a fix for this? This issue is happening alot on Ford's Tekton pods.

      — Additional comment from dahernan@redhat.com on 2022-06-02 14:35:33 UTC —

      Once this is verified, is it viable to also backport it to 4.8, 4.9 or 4.10(at least)? I do not observe any cherry pick or backport related to other versions (yet) but we have a customer (Swisscom) pushing for that, as this is impacting them.

      — Additional comment from aos-team-art-private@redhat.com on 2022-06-09 06:10:29 UTC —

      Elliott changed bug status from MODIFIED to ON_QA.
      This bug is expected to ship in the next 4.11 release.

      — Additional comment from openshift-bugzilla-robot@redhat.com on 2022-06-13 13:13:11 UTC —

      Bugfix included in accepted release 4.11.0-0.nightly-2022-06-11-054027
      Bug will not be automatically moved to VERIFIED for the following reasons:

      • PR openshift/kubernetes#1229 not approved by QA contact

      This bug must now be manually moved to VERIFIED by schoudha@redhat.com

      — Additional comment from schoudha@redhat.com on 2022-06-15 15:21:28 UTC —

      Checked on 4.11.0-0.nightly-2022-06-14-172335 by running pods over a day and don't see unexpectedly high memory usage by kubelet on node.

      % oc get clusterversion
      version 4.11.0-0.nightly-2022-06-14-172335 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-14-172335

      — Additional comment from errata-xmlrpc@redhat.com on 2022-06-15 17:57:08 UTC —

      This bug has been added to advisory RHEA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

      — Additional comment from bsmitley@redhat.com on 2022-06-27 20:34:28 UTC —

      Do we have a timeframe for this into 4.10?

      This issue is impacting Ford.

      Account name: Ford
      Account Number: 5561914
      TAM customer: yes
      CSE customer: yes
      Strategic: yes

      — Additional comment from bsmitley@redhat.com on 2022-06-28 15:28:55 UTC —

      Ford brought this up in my TAM meeting as a hot issue.

      I need an ETA of when this will be in 4.10. This just needs to be a rough date. I know normally we don't share exact dates because timeframes could change.

            rphillip@redhat.com Ryan Phillips
            rphillip@redhat.com Ryan Phillips
            Sunil Choudhary Sunil Choudhary
            1 Vote for this issue
            6 Start watching this issue