Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62853

secondary-scheduler pod gets OOMKilled on large clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.16
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          secondary-scheduler pod gets OOMKilled on large clusters:
      
      $ oc get po secondary-scheduler-7d9bf885fc-xqn2z -o yaml
      [...]
        containerStatuses:
        - containerID: cri-o://e2675fbeef0f7f3993bbb2ca63d815d19538434bc2c80e61e2c49d12ba894b0a
          image: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
          imageID: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
          lastState:
            terminated:
              containerID: cri-o://87481f8a582a7360d4b370975f397558499c185a5b00d4c34ded03ed1f702539
              exitCode: 137
              finishedAt: "2025-10-08T05:19:02Z"
              reason: OOMKilled
              startedAt: "2025-10-08T05:18:56Z"
          name: secondary-scheduler
          ready: true
          restartCount: 262
          started: true
          state:
            running:
              startedAt: "2025-10-08T05:19:44Z"
      
      $ oc get deploy -n openshift-numaresources secondary-scheduler -o jsonpath='{.spec.template.spec.containers[].resources}' | jq
      {
        "limits": {
          "cpu": "600m",
          "memory": "1200Mi"
        },
        "requests": {
          "cpu": "600m",
          "memory": "1200Mi"
        }
      }

      Version-Release number of selected component (if applicable):

          numaresources-operator.v4.16.3
      

      How reproducible:

          Observable on a large cluster, affected one has 90 nodes: 3 master + 18 storage + 69 worker

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      secondary-scheduler pod gets OOMKilled

      Expected results:

      secondary pod should be stable

      Additional info:

          It would be a good idea to introduce the ability to tune limits and requests of this pod, perhaps in the NUMAResourcesScheduler CR
      

              fromani@redhat.com Francesco Romani
              fcristin1@redhat.com Francesco Cristini
              None
              None
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: