Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61725

[4.14] kubelet podresources API incorrectly reports memory assignments of terminated pods

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Critical Critical
    • None
    • 4.14.z
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • All
    • None
    • None
    • None
    • In Progress
    • Release Note Not Required
    • None
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-56785. The following is the description of the original issue:

      Description of problem:

      The kubelet podresources endpoint returns allocated exclusive resources to active pods.
      The endpoint incorrectly returns resources allocated to terminated pods.
      
      There are 2 factors which concur to create the bug
      
      1. the podresources API depends on inner working of kubelet to retrieve the list of currently active pods. Previously, the related function incorrectly returned active and terminated pods.
      2. if the podresources API incorrectly consider a terminated pod, we run into another issue in  memory manager. The memory manager collects stale resources (assignment to terminated pods) only in the allocation flow. Thus, if no pods manage to get admitted, the kubelet through podresources API incorrectly reports memory resources assigned to a terminated pods. This reporting is bogus as these resources are not reserved anymore, but the podresources API cannot know that. This does NOT affect the allocation flow (first thing it does is cleanup) but does affect the reporting, and this behavior is not fixed upstream.
      
      
      why this affects only memory?
      
      1. device assignment is explicitely cleaned by the podresources API endpoint
      2. the cpu assignment is not (and it should) but it is automatically 
      cleaned every cpuManagerReconcilePeriod seconds, so it automatically 
      recovers
      
      this breaks in an unrecoverable way numa aware scheduling.

      Version-Release number of selected component (if applicable):

      4.18.z (any)
      actually reproduced in Server Version: 4.18.0-0.nightly-2025-04-13-142946
          

      How reproducible:

      100%    

      Steps to Reproduce:

          1. configure the kubelet with memory manager policy = Static
          2. run a job whose pod qualify for memory pinning (see example manifest below)
          3. query the podresources endpoint on the nodes. The endpoint is node-local exposed through a unix domain socket. It has to be queried programmatically. Probably the simplest option is to download the `knit` tool from https://github.com/openshift-kni/debug-tools/releases/tag/v0.2.1 and to use it like `knit podres` with root privileges.
      
      
      example manifest:
      ```
      apiVersion: batch/v1
      kind: Job
      metadata:
        labels:
          app: idle-gu-job-sched-stall
        generateName: generic-pause-
      spec:
        backoffLimit: 6
        completionMode: NonIndexed
        completions: 2
        manualSelector: false
        parallelism: 2
        podReplacementPolicy: TerminatingOrFailed
        suspend: false
        template:
          metadata:
            labels:
              app: idle-gu-job-sched-stall
          spec:
            containers:
            - args:
              - 1s
              command:
              - /bin/sleep
              image: quay.io/openshift-kni/pause:test-ci
              imagePullPolicy: IfNotPresent
              name: generic-job-idle
              resources:
                limits:
                  cpu: 100m
                  memory: 256Mi
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
            dnsPolicy: ClusterFirst
            restartPolicy: Never
            schedulerName: default-scheduler
            terminationGracePeriodSeconds: 30
            topologySpreadConstraints:
            - labelSelector:
                matchLabels:
                  app: idle-gu-job-sched-stall
              matchLabelKeys:
              - pod-template-hash
              maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
      
      ```    

      Actual results:

      kubelet returns memory resources assigned to terminated pod

      Expected results:

      either:
      1. kubelet does not return the terminated pod
      2. kubelet return the terminated pod, but without any resource assigned to it

      Additional info:

      possibly affects older versions of openshift
      solved kubernetes upstream by the pod workers refactoring: the podresources endpoint (correctly) ignores terminated pods and only lists active pods

       

              aos-node@redhat.com Node Team Bot Account
              fromani@redhat.com Francesco Romani
              None
              None
              Bhargavi Gudi Bhargavi Gudi
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: