Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66238

Bare Metal node extreme load after huge ~ 700 pod deployment.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.19.z
    • Node / Kubelet
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • Node Green Sprint 280, OCP Node Core Sprint 282
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      I am investigating an odd issue with kubelet that seems to have been introduced in v4.19 between specific z-streams (v4.19.14 --> v4.19.18). The issue is affecting only BareMetal nodes as it seems with huge capacity of 120 + CPU and lots of RAM. The issue is that whenever the customer deploys ~ 700 pods simultaneously the kubelet trying to mount 2500 + secrets/configmaps simultaneously which causes huge CPU load that makes the node unusable. At first we though this was an issue with kernel but kernel collab shows that there is probably some change introduced that is causing huge amount of processes to stay on D state causing CPU saturation and node being unrensponsive. Kernel comment in the bottom line is the below:

      This shows that the pods together account for 2345 mounts (mostly tmpfs secret/projected volumes), which is a primary factor inducing the shrinker_rwsem contention. With hundreds of pods and their thousands of tmpfs mounts, it is quite natural that shrinker_rwsem becomes a hot contention point. The issue is more likely a workload and scaling problem in the OCP environment, rather than a kernel bug.

       

      The version of kubelet that was updated between these 2 z-streams is:

      openshift-kubelet 4.19.0-202509122308.p2.g335be3a.assembly.stream.el9 → 4.19.0-202510101528.p2.gf94ad89.assembly.stream.el9

      Important Notes:

      • We were able to mitigate the issue by downgrading these workers CoreOS image to v4.19.14 with a MachineConfig.
      • Extensive analysis of the issue from kernel team is in the attached case as well as vmcores and sosreports from these nodes.
      • The same issue is not visible to VM nodes that are also part of the cluster. But not sure about their capacity or deployment volume on these ones as of now. But i can ask if required. 

      Version-Release number of selected component (if applicable):

      4.19.0-202510101528.p2.gf94ad89.assembly.stream.el9

      How reproducible:

      - Upgrade to v4.19.18
      - Deploy ~ 700 pods simultaneously on the node
      - See the node load rising until it gets completely unresponsive.

      Actual results:

      - The node becomes completely unresponsive    

      Expected results:

      - The node should not become unresponsive    

      Additional info:

          

              harpatil@redhat.com Harshal Patil
              rhn-support-nstamate Nikolaos Stamatelopoulos
              None
              Qiujie Li
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: