-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
CNF Compute Sprint 279
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
secondary-scheduler pod gets OOMKilled on large clusters:
$ oc get po secondary-scheduler-7d9bf885fc-xqn2z -o yaml
[...]
containerStatuses:
- containerID: cri-o://e2675fbeef0f7f3993bbb2ca63d815d19538434bc2c80e61e2c49d12ba894b0a
image: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
imageID: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
lastState:
terminated:
containerID: cri-o://87481f8a582a7360d4b370975f397558499c185a5b00d4c34ded03ed1f702539
exitCode: 137
finishedAt: "2025-10-08T05:19:02Z"
reason: OOMKilled
startedAt: "2025-10-08T05:18:56Z"
name: secondary-scheduler
ready: true
restartCount: 262
started: true
state:
running:
startedAt: "2025-10-08T05:19:44Z"
$ oc get deploy -n openshift-numaresources secondary-scheduler -o jsonpath='{.spec.template.spec.containers[].resources}' | jq
{
"limits": {
"cpu": "600m",
"memory": "1200Mi"
},
"requests": {
"cpu": "600m",
"memory": "1200Mi"
}
}
Version-Release number of selected component (if applicable):
numaresources-operator.v4.16.3
How reproducible:
Observable on a large cluster, affected one has 90 nodes: 3 master + 18 storage + 69 worker
Steps to Reproduce:
1.
2.
3.
Actual results:
secondary-scheduler pod gets OOMKilled
Expected results:
secondary pod should be stable
Additional info:
It would be a good idea to introduce the ability to tune limits and requests of this pod, perhaps in the NUMAResourcesScheduler CR