-
Bug
-
Resolution: Unresolved
-
Normal
-
4.14.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
8
-
Moderate
-
No
-
None
-
None
-
MON Sprint 278
-
1
-
Done
-
Bug Fix
-
-
None
-
None
-
None
-
None
Description of problem:
On a Single Node OpenShift (SNO) +1 (worker node) setup `KubeMemoryOvercommit` alert is fired after restarting cri-o
Version-Release number of selected component (if applicable):
4.14.22
How reproducible:
Setup SNO +1 cluster and install applications, afterwards restart crio-o
Steps to Reproduce:
1. Setup SNO+1 cluster
2. Deploy applications
3. Check for alerts (should be none)
4. restart cri-o service
5. Find `KubeMemoryOvercommit` alert triggered
Actual results:
"startsAt": "2024-02-12T11:36:12.933Z",
"endsAt": "2024-02-13T04:01:12.933Z",
"generatorURL": https://console-openshift-console.apps.example.net/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0\u0026g0.tab=1,
"status": {
"state": "suppressed",
"silencedBy": ["04274c0c-1e7d-61ce-9d62-579582cb5fea"],
"inhibitedBy": null
},
"receivers": ["Default"],
"fingerprint": "72ef0aaed2527c32"
}, {
"labels": {
"alertname": "KubeMemoryOvercommit",
"namespace": "kube-system",
"openshift_io_alert_source": "platform",
"prometheus": "openshift-monitoring/k8s",
"severity": "warning"
},
"annotations": {
"description": "Cluster has overcommitted memory resource requests for Pods by 35.85G bytes and cannot tolerate node failure.",
"summary": "Cluster has overcommitted memory resource requests."
Expected results:
No alert triggered upon cri-o restart
Additional info:
For SNO+1 clusters there should be change implemented on the AlertQuery on this part:
(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster))
That piece works on SNO-clusters because SUM(All nodes memory) - MAX(All nodes memory) is always 0 on SNO-clusters (The Sum and Max are the same value).
For SNO+1 there could be an another check:
count(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"})' > 2, or similar to take account the node count and do not trigger the alert on SNO+1 use cases.
- clones
-
OCPBUGS-34568 KubeMemoryOvercommit triggered after cri-o restart on SNO+1
-
- Closed
-
- depends on
-
OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment
-
- Closed
-
- is related to
-
OCPBUGS-62965 [4.19 Backport]`KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment
-
- Verified
-
-
OCPBUGS-35095 `KubeCPUOvercommit` Alert Not Triggered Despite Node CPU is Overcommitment
-
- Closed
-
- links to
(3 links to)