Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34568

KubeMemoryOvercommit triggered after cri-o restart on SNO+1

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z
    • Monitoring
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      On a Single Node OpenShift (SNO) +1 (worker node) setup `KubeMemoryOvercommit` alert is fired after restarting cri-o 

      Version-Release number of selected component (if applicable):

      4.14.22

      How reproducible:

       Setup SNO +1 cluster and install applications, afterwards restart crio-o

      Steps to Reproduce:

          1. Setup SNO+1 cluster
          2. Deploy applications
          3. Check for alerts (should be none)
          4. restart cri-o service
          5. Find `KubeMemoryOvercommit` alert triggered     

      Actual results:

            "startsAt": "2024-02-12T11:36:12.933Z",
                  "endsAt": "2024-02-13T04:01:12.933Z",
                  "generatorURL": https://console-openshift-console.apps.example.net/monitoring/graph?g0.expr=cluster%3Aalertmanager_integrations%3Amax+%3D%3D+0\u0026g0.tab=1,
                  "status": {
                      "state": "suppressed",
                      "silencedBy": ["04274c0c-1e7d-61ce-9d62-579582cb5fea"],
                      "inhibitedBy": null
                  },
                  "receivers": ["Default"],
                  "fingerprint": "72ef0aaed2527c32"
              }, {
                  "labels": {
                      "alertname": "KubeMemoryOvercommit",
                      "namespace": "kube-system",
                      "openshift_io_alert_source": "platform",
                      "prometheus": "openshift-monitoring/k8s",
                      "severity": "warning"
                  },
                  "annotations": {
                      "description": "Cluster  has overcommitted memory resource requests for Pods by 35.85G bytes and cannot tolerate node failure.",
                      "summary": "Cluster has overcommitted memory resource requests."   

      Expected results:

          No alert triggered upon cri-o restart

      Additional info:

      For SNO+1 clusters there should be change implemented on the AlertQuery on this part:
      
      (sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) 
      
      That piece works on SNO-clusters because SUM(All nodes memory) - MAX(All nodes memory) is always 0 on SNO-clusters (The Sum and Max are the same value).
      
      For SNO+1 there could be an another check:
      count(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"})' > 2, or similar to take account the node count and do not trigger the alert on SNO+1 use cases.
      

              prasriva@redhat.com Pranshu Srivastava
              rhn-support-dmoessner Daniel Moessner
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: