Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Node / Numa aware Scheduling
Labels:
- ert:pending-onqa-over-96hrs

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.16.z
Release Blocker:
None
Sprint:
CNF Compute Sprint 279
sprint_count:
1

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    secondary-scheduler pod gets OOMKilled on large clusters:

$ oc get po secondary-scheduler-7d9bf885fc-xqn2z -o yaml
[...]
  containerStatuses:
  - containerID: cri-o://e2675fbeef0f7f3993bbb2ca63d815d19538434bc2c80e61e2c49d12ba894b0a
    image: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
    imageID: registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9@sha256:a633cf3d9ae757c9316eae69b1e41cd27c4c3c114cc979ae973201a4328532ed
    lastState:
      terminated:
        containerID: cri-o://87481f8a582a7360d4b370975f397558499c185a5b00d4c34ded03ed1f702539
        exitCode: 137
        finishedAt: "2025-10-08T05:19:02Z"
        reason: OOMKilled
        startedAt: "2025-10-08T05:18:56Z"
    name: secondary-scheduler
    ready: true
    restartCount: 262
    started: true
    state:
      running:
        startedAt: "2025-10-08T05:19:44Z"

$ oc get deploy -n openshift-numaresources secondary-scheduler -o jsonpath='{.spec.template.spec.containers[].resources}' | jq
{
  "limits": {
    "cpu": "600m",
    "memory": "1200Mi"
  },
  "requests": {
    "cpu": "600m",
    "memory": "1200Mi"
  }
}

Version-Release number of selected component (if applicable):

    numaresources-operator.v4.16.3

How reproducible:

    Observable on a large cluster, affected one has 90 nodes: 3 master + 18 storage + 69 worker

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

secondary-scheduler pod gets OOMKilled

Expected results:

secondary pod should be stable

Additional info:

    It would be a good idea to introduce the ability to tune limits and requests of this pod, perhaps in the NUMAResourcesScheduler CR

links to

https://github.com/openshift-kni/numaresources-operator/pull/2271

Assignee:: Francesco Romani

Reporter:: Francesco Cristini

Need Info From:: Roy Shemtov, Yang Liu

Contributors:: None

QA Contact:: Roy Shemtov

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/10/08 4:07 PM

Updated:: 2025/11/08 12:37 AM

Resolved:: 2025/10/30 12:04 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates