-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
secondary-scheduler pod gets OOMKilled on large clusters: $ oc get po secondary-scheduler-7d9bf885fc-xqn2z -o yaml [...] containerStatuses: - containerID: cri-o://e2675fbeef0f7f3993bbb2ca63d815d19538434bc2c80e61e2c49d12ba894b0a image: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed imageID: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed lastState: terminated: containerID: cri-o://87481f8a582a7360d4b370975f397558499c185a5b00d4c34ded03ed1f702539 exitCode: 137 finishedAt: "2025-10-08T05:19:02Z" reason: OOMKilled startedAt: "2025-10-08T05:18:56Z" name: secondary-scheduler ready: true restartCount: 262 started: true state: running: startedAt: "2025-10-08T05:19:44Z" $ oc get deploy -n openshift-numaresources secondary-scheduler -o jsonpath='{.spec.template.spec.containers[].resources}' | jq { "limits": { "cpu": "600m", "memory": "1200Mi" }, "requests": { "cpu": "600m", "memory": "1200Mi" } }
Version-Release number of selected component (if applicable):
numaresources-operator.v4.16.3
How reproducible:
Observable on a large cluster, affected one has 90 nodes: 3 master + 18 storage + 69 worker
Steps to Reproduce:
1. 2. 3.
Actual results:
secondary-scheduler pod gets OOMKilled
Expected results:
secondary pod should be stable
Additional info:
It would be a good idea to introduce the ability to tune limits and requests of this pod, perhaps in the NUMAResourcesScheduler CR