Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35832

[release-4.15] Misleading alert regarding high control plane CPU utilization in Single Node OpenShift (SNO) cluster

XMLWordPrintable

    • Low
    • No
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the `HighOverallControlPlaneCPU` alert triggered warnings based on criteria for multi-node clusters with high availability. As a result, misleading alerts were triggered in {sno} clusters because the configuration did not match the environment criteria. This update refines the alert logic to use {sno}-specific queries and thresholds and account for workload partitioning settings. As a result, CPU utilization alerts in {sno} clusters are accurate and relevant to single-node configurations. (link:https://issues.redhat.com/browse/OCPBUGS-35832[*OCPBUGS-35832*])
      Show
      Previously, the `HighOverallControlPlaneCPU` alert triggered warnings based on criteria for multi-node clusters with high availability. As a result, misleading alerts were triggered in {sno} clusters because the configuration did not match the environment criteria. This update refines the alert logic to use {sno}-specific queries and thresholds and account for workload partitioning settings. As a result, CPU utilization alerts in {sno} clusters are accurate and relevant to single-node configurations. (link: https://issues.redhat.com/browse/OCPBUGS-35832 [* OCPBUGS-35832 *])
    • Done

      The monitoring system for  Single Node OpenShift (SNO) cluster is triggering an alert named "HighOverallControlPlaneCPU" related to excessive control plane CPU utilization. However, this alert is misleading as it assumes a multi-node setup with high availability (HA) considerations, which do not apply to  SNO deployment.

       
      The customer is receiving MNO alerts in the SNO cluster. Below are the details:
       
      The vDU with 2xRINLINE card is installed on the SNO node with OCP 4.14.14.
      Used hardware: Airframe OE22 2U server CPU Intel(R) Xeon Intel(R) Xeon(R) Gold 6428N SPR-SP S3, (32 cores 64 threads) with 128GB memory.
       
      After all vDU pods became running, a few minutes later the following alert was triggered:
       
        "labels":

      {    "alertname": "HighOverallControlPlaneCPU",    "namespace": "openshift-kube-apiserver",    "openshift_io_alert_source": "platform",    "prometheus": "openshift-monitoring/k8s",    "severity": "warning"    }

      ,
         "annotations": {
         "description": "Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. 
      This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. 
      If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. 
      To fix this, increase the CPU and memory on your control plane nodes.",
         "runbook_url": https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md,
         "summary": "CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; 
         a single control plane node outage may cause a cascading failure; increase available CPU."
       
      The alert description is misleading since this cluster is SNO, there is no HA in this cluster.
      Increasing CPU capacity in SNO cluster is not an option.  
      Although the CPU usage is high, this alarm is not correct. 
      MNO and SNO clusters should have separate alert descriptions.
       

              bzamalut@redhat.com Bulat Zamalutdinov
              rhn-support-bhab Bharathi B
              Ke Wang Ke Wang
              Alexandra Molnar Alexandra Molnar
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: