Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Monitoring
Labels:
None

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

On a Single Node OpenShift (SNO) +1 (worker node) setup `KubeMemoryOvercommit` alert is fired after restarting cri-o

Version-Release number of selected component (if applicable):

4.14.22

How reproducible:

 Setup SNO +1 cluster and install applications, afterwards restart crio-o

Steps to Reproduce:

    1. Setup SNO+1 cluster
    2. Deploy applications
    3. Check for alerts (should be none)
    4. restart cri-o service
    5. Find `KubeMemoryOvercommit` alert triggered

Actual results:

      "startsAt": "2024-02-12T11:36:12.933Z",
            "endsAt": "2024-02-13T04:01:12.933Z",
            "generatorURL": https://console-openshift-console.apps.example.net/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0\u0026g0.tab=1,
            "status": {
                "state": "suppressed",
                "silencedBy": ["04274c0c-1e7d-61ce-9d62-579582cb5fea"],
                "inhibitedBy": null
            },
            "receivers": ["Default"],
            "fingerprint": "72ef0aaed2527c32"
        }, {
            "labels": {
                "alertname": "KubeMemoryOvercommit",
                "namespace": "kube-system",
                "openshift_io_alert_source": "platform",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
            },
            "annotations": {
                "description": "Cluster  has overcommitted memory resource requests for Pods by 35.85G bytes and cannot tolerate node failure.",
                "summary": "Cluster has overcommitted memory resource requests."

Expected results:

    No alert triggered upon cri-o restart

Additional info:

For SNO+1 clusters there should be change implemented on the AlertQuery on this part:

(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) 

That piece works on SNO-clusters because SUM(All nodes memory) - MAX(All nodes memory) is always 0 on SNO-clusters (The Sum and Max are the same value).

For SNO+1 there could be an another check:
count(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"})' > 2, or similar to take account the node count and do not trigger the alert on SNO+1 use cases.

is related to

OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment

POST

Assignee:: Pranshu Srivastava

Reporter:: Daniel Moessner

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/05/29 7:41 AM

Updated:: 2025/01/23 6:30 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates