-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
None
-
Moderate
-
No
-
False
-
-
Description of problem:
On a Single Node OpenShift (SNO) +1 (worker node) setup `KubeMemoryOvercommit` alert is fired after restarting cri-o
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Setup SNO +1 cluster and install applications, afterwards restart crio-o
Steps to Reproduce:
1. Setup SNO+1 cluster 2. Deploy applications 3. Check for alerts (should be none) 4. restart cri-o service 5. Find `KubeMemoryOvercommit` alert triggered
Actual results:
"startsAt": "2024-02-12T11:36:12.933Z", "endsAt": "2024-02-13T04:01:12.933Z", "generatorURL": https://console-openshift-console.apps.example.net/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0\u0026g0.tab=1, "status": { "state": "suppressed", "silencedBy": ["04274c0c-1e7d-61ce-9d62-579582cb5fea"], "inhibitedBy": null }, "receivers": ["Default"], "fingerprint": "72ef0aaed2527c32" }, { "labels": { "alertname": "KubeMemoryOvercommit", "namespace": "kube-system", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "annotations": { "description": "Cluster has overcommitted memory resource requests for Pods by 35.85G bytes and cannot tolerate node failure.", "summary": "Cluster has overcommitted memory resource requests."
Expected results:
No alert triggered upon cri-o restart
Additional info:
For SNO+1 clusters there should be change implemented on the AlertQuery on this part: (sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) That piece works on SNO-clusters because SUM(All nodes memory) - MAX(All nodes memory) is always 0 on SNO-clusters (The Sum and Max are the same value). For SNO+1 there could be an another check: count(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"})' > 2, or similar to take account the node count and do not trigger the alert on SNO+1 use cases.
- is related to
-
OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment
- POST