[OCPBUGS-31354] [release-4.14] Misleading alert regarding high control plane CPU utilization in Single Node OpenShift (SNO) cluster

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.14.z
Affects Version/s: 4.14
Component/s: kube-apiserver
Labels:
- pre-merge-tested
- qe-ci-tested

Severity:
Moderate
Regression:
No
Story Points:
2
Sprint:
OCPEDGE Sprint 253, OCPEDGE Sprint 254, OCPEDGE Sprint 255, OCPEDGE Sprint 256
sprint_count:
4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously, the `HighOverallControlPlaneCPU` alert triggered warnings based on criteria for multi-node clusters with high availability. As a result, misleading alerts were triggered in {sno} clusters because the configuration did not match the environment criteria. This update refines the alert logic to use {sno}-specific queries and thresholds and account for workload partitioning settings. As a result, CPU utilization alerts in {sno} clusters are accurate and relevant to single-node configurations. (link:https://issues.redhat.com/browse/OCPBUGS-31354[*~~OCPBUGS-31354~~*])

Show
Previously, the `HighOverallControlPlaneCPU` alert triggered warnings based on criteria for multi-node clusters with high availability. As a result, misleading alerts were triggered in {sno} clusters because the configuration did not match the environment criteria. This update refines the alert logic to use {sno}-specific queries and thresholds and account for workload partitioning settings. As a result, CPU utilization alerts in {sno} clusters are accurate and relevant to single-node configurations. (link: https://issues.redhat.com/browse/OCPBUGS-31354 [* OCPBUGS-31354 *])
Release Note Status:
Done
RH Private Keywords:
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

The monitoring system for Single Node OpenShift (SNO) cluster is triggering an alert named "HighOverallControlPlaneCPU" related to excessive control plane CPU utilization. However, this alert is misleading as it assumes a multi-node setup with high availability (HA) considerations, which do not apply to SNO deployment.

The customer is receiving MNO alerts in the SNO cluster. Below are the details:

The vDU with 2xRINLINE card is installed on the SNO node with OCP 4.14.14.
Used hardware: Airframe OE22 2U server CPU Intel(R) Xeon Intel(R) Xeon(R) Gold 6428N SPR-SP S3, (32 cores 64 threads) with 128GB memory.

After all vDU pods became running, a few minutes later the following alert was triggered:

"labels":

{ "alertname": "HighOverallControlPlaneCPU", "namespace": "openshift-kube-apiserver", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }

,
"annotations": {
"description": "Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity.
This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA.
If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load.
To fix this, increase the CPU and memory on your control plane nodes.",
"runbook_url": https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md,
"summary": "CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain;
a single control plane node outage may cause a cascading failure; increase available CPU."

The alert description is misleading since this cluster is SNO, there is no HA in this cluster.
Increasing CPU capacity in SNO cluster is not an option.
Although the CPU usage is high, this alarm is not correct.
MNO and SNO clusters should have separate alert descriptions.

depends on

OCPBUGS-35832 [release-4.15] Misleading alert regarding high control plane CPU utilization in Single Node OpenShift (SNO) cluster

Closed

relates to

OCPBUGS-22117 HighOverallControlPlaneCPU alert for SNO has wrong threshold

Closed

links to

openshift/cluster-kube-apiserver-operator#1707: [release-4.14]OCPBUGS-31354: add SNO control plane high cpu usage alert

RHBA-2024:4479 OpenShift Container Platform 4.14.z bug fix update

Errata Tool added a comment - 2024/07/17 12:38 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Important: OpenShift Container Platform 4.14.33 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:4479

Errata Tool added a comment - 2024/07/17 12:38 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.33 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4479

Ke Wang added a comment - 2024/07/05 9:40 AM

4.16 related bug https://issues.redhat.com/browse/OCPBUGS-35831 has been verified. we can continue it on 4.15 and 4.14.
Pre-merge verified, refer to steps of bug https://issues.redhat.com/browse/OCPBUGS-31354?focusedId=24636314&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24636314,

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.ci.test-2024-07-05-043558-ci-ln-mwbj2yt-latest True False 11m Cluster version is 4.14.0-0.ci.test-2024-07-05-043558-ci-ln-mwbj2yt-latest

$ oc get no
NAME STATUS ROLES AGE VERSION
ip-10-0-58-51.us-west-1.compute.internal Ready control-plane,master,worker 28m v1.27.14+95b99ee

Alerts HighOverallControlPlaneCPU and ExtremelyHighIndividualControlPlaneCPU are fired as expected.

Ke Wang added a comment - 2024/07/05 9:40 AM 4.16 related bug https://issues.redhat.com/browse/OCPBUGS-35831 has been verified. we can continue it on 4.15 and 4.14. Pre-merge verified, refer to steps of bug https://issues.redhat.com/browse/OCPBUGS-31354?focusedId=24636314&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24636314 , $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.ci.test-2024-07-05-043558-ci-ln-mwbj2yt-latest True False 11m Cluster version is 4.14.0-0.ci.test-2024-07-05-043558-ci-ln-mwbj2yt-latest $ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-58-51.us-west-1.compute.internal Ready control-plane,master,worker 28m v1.27.14+95b99ee Alerts HighOverallControlPlaneCPU and ExtremelyHighIndividualControlPlaneCPU are fired as expected.

Bulat Zamalutdinov added a comment - 2024/06/19 8:52 AM

Hi rhn-support-dmoessner
Yes, PR got merged in. I need to backport it from 4.16 to 4.14. I'll try to push people to approve backports asap

Bulat Zamalutdinov added a comment - 2024/06/19 8:52 AM Hi rhn-support-dmoessner Yes, PR got merged in. I need to backport it from 4.16 to 4.14. I'll try to push people to approve backports asap

Bulat Zamalutdinov added a comment - 2024/05/22 12:36 PM

Hey rhn-support-dmoessner
I've closed backport PR because original PR got reverted
You can track the progress at GH LINK. We do have hot debates in it

Bulat Zamalutdinov added a comment - 2024/05/22 12:36 PM Hey rhn-support-dmoessner I've closed backport PR because original PR got reverted You can track the progress at GH LINK . We do have hot debates in it

Bulat Zamalutdinov added a comment - 2024/05/06 7:27 AM

Hi rhn-support-dmoessner. I'm waiting for last labels to be applied to PR so i can merge it.

Yes, when workload partitioning is enabled we take it into an account to adjust alert threshold. When it's not enabled we provide default values

Bulat Zamalutdinov added a comment - 2024/05/06 7:27 AM Hi rhn-support-dmoessner . I'm waiting for last labels to be applied to PR so i can merge it. Yes, when workload partitioning is enabled we take it into an account to adjust alert threshold. When it's not enabled we provide default values

Junqi Zhao added a comment - 2024/04/30 6:07 AM

it is kube-apiserver issue, should be someone from apiserver-qe-team to verify

$ oc -n openshift-kube-apiserver get prometheusrules cpu-utilization -oyaml 
..
spec:
  groups:
  - name: control-plane-cpu-utilization
    rules:
    - alert: HighOverallControlPlaneCPU
      annotations:
        description: Given three control plane nodes, the overall CPU utilization
          may only be about 2/3 of all available capacity. This is because if a single
          control plane node fails, the remaining two must handle the load of the
          cluster in order to be HA. If the cluster is using more than 2/3 of all
          capacity, if one control plane node fails, the remaining two are likely
          to fail when they take the load. To fix this, increase the CPU and memory
          on your control plane nodes.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md
        summary: CPU utilization across all three control plane nodes is higher than
          two control plane nodes can sustain; a single control plane node outage
          may cause a cascading failure; increase available CPU.
      expr: |
        sum(
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
          AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
        )
        /
        count(kube_node_role{role="master"})
        > 60
      for: 10m
      labels:
        namespace: openshift-kube-apiserver
        severity: warning

assign to wk2019

Junqi Zhao added a comment - 2024/04/30 6:07 AM it is kube-apiserver issue, should be someone from apiserver-qe-team to verify $ oc -n openshift-kube-apiserver get prometheusrules cpu-utilization -oyaml .. spec: groups: - name: control-plane-cpu-utilization rules: - alert: HighOverallControlPlaneCPU annotations: description: Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this , increase the CPU and memory on your control plane nodes. runbook_url: https: //github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver- operator /ExtremelyHighIndividualControlPlaneCPU.md summary: CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain; a single control plane node outage may cause a cascading failure; increase available CPU. expr: | sum( 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode= "idle" }[1m])) * 100) AND on (instance) label_replace( kube_node_role{role= "master" }, "instance" , "$1" , "node" , "(.+)" ) ) / count(kube_node_role{role= "master" }) > 60 for : 10m labels: namespace: openshift-kube-apiserver severity: warning assign to wk2019

Bulat Zamalutdinov added a comment - 2024/04/26 3:09 PM

juzhao@redhat.com
I opened a backport PR with the fix there
It seems working fine, can you verify it as well?

Bulat Zamalutdinov added a comment - 2024/04/26 3:09 PM juzhao@redhat.com I opened a backport PR with the fix there It seems working fine, can you verify it as well?

Simon Pasquier added a comment - 2024/03/25 4:28 PM

It seems related to https://issues.redhat.com//browse/OCPBUGS-22117 but I'll let the SNO team triage.

Simon Pasquier added a comment - 2024/03/25 4:28 PM It seems related to https://issues.redhat.com//browse/OCPBUGS-22117 but I'll let the SNO team triage.

Assignee:: Bulat Zamalutdinov

Reporter:: Bharathi B

QA Contact:: Ke Wang

Doc Contact:: Alexandra Molnar

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2024/03/25 11:53 AM

Updated:: 2024/09/12 5:05 AM

Resolved:: 2024/07/17 12:38 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/07/17 12:38 AM

Expand comment: Errata Tool added a comment - 2024/07/17 12:38 AM

Collapse comment: Ke Wang added a comment - 2024/07/05 9:40 AM

Expand comment: Ke Wang added a comment - 2024/07/05 9:40 AM

Collapse comment: Bulat Zamalutdinov added a comment - 2024/06/19 8:52 AM

Expand comment: Bulat Zamalutdinov added a comment - 2024/06/19 8:52 AM

Collapse comment: Bulat Zamalutdinov added a comment - 2024/05/22 12:36 PM

Expand comment: Bulat Zamalutdinov added a comment - 2024/05/22 12:36 PM

Collapse comment: Bulat Zamalutdinov added a comment - 2024/05/06 7:27 AM

Expand comment: Bulat Zamalutdinov added a comment - 2024/05/06 7:27 AM

Collapse comment: Junqi Zhao added a comment - 2024/04/30 6:07 AM

Expand comment: Junqi Zhao added a comment - 2024/04/30 6:07 AM

Collapse comment: Bulat Zamalutdinov added a comment - 2024/04/26 3:09 PM

Expand comment: Bulat Zamalutdinov added a comment - 2024/04/26 3:09 PM

Collapse comment: Simon Pasquier added a comment - 2024/03/25 4:28 PM

Expand comment: Simon Pasquier added a comment - 2024/03/25 4:28 PM

People

Dates