Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.19.z
Affects Version/s: 4.14.z
Component/s: Monitoring
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
8
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:

4.19.z
Release Blocker:
None
Sprint:
MON Sprint 278
sprint_count:
1

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
Prevent false KubeMemoryOvercommit alerts under permitted memory limits::
Before this update, the `KubeMemoryOvercommit` alert could falsely trigger on small multi-node clusters after memory-consuming spikes within the permitted limits. With this release, the alert expression is adjusted to correctly account for small multi-node clusters. As a result, the `KubeMemoryOvercommit` alert no longer falsely triggers after such instances.
+
link:https://issues.redhat.com/browse/OCPBUGS-34568[~~OCPBUGS-34568~~]

Show
Prevent false KubeMemoryOvercommit alerts under permitted memory limits:: Before this update, the `KubeMemoryOvercommit` alert could falsely trigger on small multi-node clusters after memory-consuming spikes within the permitted limits. With this release, the alert expression is adjusted to correctly account for small multi-node clusters. As a result, the `KubeMemoryOvercommit` alert no longer falsely triggers after such instances. + link: https://issues.redhat.com/browse/OCPBUGS-34568 [ OCPBUGS-34568 ]

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

On a Single Node OpenShift (SNO) +1 (worker node) setup `KubeMemoryOvercommit` alert is fired after restarting cri-o

Version-Release number of selected component (if applicable):

4.14.22

How reproducible:

 Setup SNO +1 cluster and install applications, afterwards restart crio-o

Steps to Reproduce:

    1. Setup SNO+1 cluster
    2. Deploy applications
    3. Check for alerts (should be none)
    4. restart cri-o service
    5. Find `KubeMemoryOvercommit` alert triggered

Actual results:

      "startsAt": "2024-02-12T11:36:12.933Z",
            "endsAt": "2024-02-13T04:01:12.933Z",
            "generatorURL": https://console-openshift-console.apps.example.net/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0\u0026g0.tab=1,
            "status": {
                "state": "suppressed",
                "silencedBy": ["04274c0c-1e7d-61ce-9d62-579582cb5fea"],
                "inhibitedBy": null
            },
            "receivers": ["Default"],
            "fingerprint": "72ef0aaed2527c32"
        }, {
            "labels": {
                "alertname": "KubeMemoryOvercommit",
                "namespace": "kube-system",
                "openshift_io_alert_source": "platform",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
            },
            "annotations": {
                "description": "Cluster  has overcommitted memory resource requests for Pods by 35.85G bytes and cannot tolerate node failure.",
                "summary": "Cluster has overcommitted memory resource requests."

Expected results:

    No alert triggered upon cri-o restart

Additional info:

For SNO+1 clusters there should be change implemented on the AlertQuery on this part:

(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) 

That piece works on SNO-clusters because SUM(All nodes memory) - MAX(All nodes memory) is always 0 on SNO-clusters (The Sum and Max are the same value).

For SNO+1 there could be an another check:
count(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"})' > 2, or similar to take account the node count and do not trigger the alert on SNO+1 use cases.

clones

OCPBUGS-34568 KubeMemoryOvercommit triggered after cri-o restart on SNO+1

Closed

depends on

OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment

Closed

is related to

OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment

Closed

OCPBUGS-62965 [4.19 Backport]`KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment

Closed

links to

openshift/cluster-monitoring-operator#2630: OCPBUGS-34568,OCPBUGS-35095,OCPBUGS-60689,OCPBUGS-60691,OCPBUGS-60692: non-HA alert cases

openshift/cluster-monitoring-operator#2709: OCPBUGS-62965,OCPBUGS-62966,OCPBUGS-62967,OCPBUGS-62968,OCPBUGS-62969: [Backport] non-HA alert cases

Support non-HA scenarios for CPU and Memory overcommitment alerts

Sync changes mid-stream to PO

(3 links to)

Assignee:: Pranshu Srivastava

Reporter:: Daniel Moessner

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: Eliska Romanova

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/10/11 7:08 PM

Updated:: 2025/11/05 12:27 PM

Resolved:: 2025/11/05 12:27 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates