Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.17.0
Affects Version/s: 4.17
Component/s: Installer / Single Node OpenShift
Labels:
- ocpedge
- sno
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Moderate
Regression:
No
Latest Status Summary:
After speaking with monitoring team, i'll proceed with measuring cpu usage of control plane components described in the openshift docs

Target Backport Versions:
None
Target Version:

4.17.0
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, after installing a :sno: cluster, the monitoring system could produce an alert that applied to clusters with multiple nodes. With this update, :sno: clusters only produce monitoring alerts that apply to :sno: clusters. (link:https://issues.redhat.com/browse/OCPBUGS-35833[*~~OCPBUGS-35833~~*])

Show
* Previously, after installing a :sno: cluster, the monitoring system could produce an alert that applied to clusters with multiple nodes. With this update, :sno: clusters only produce monitoring alerts that apply to :sno: clusters. (link: https://issues.redhat.com/browse/OCPBUGS-35833 [* OCPBUGS-35833 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

The monitoring system for Single Node OpenShift (SNO) cluster is triggering an alert named "HighOverallControlPlaneCPU" related to excessive control plane CPU utilization. However, this alert is misleading as it assumes a multi-node setup with high availability (HA) considerations, which do not apply to SNO deployment.

The customer is receiving MNO alerts in the SNO cluster. Below are the details:

The vDU with 2xRINLINE card is installed on the SNO node with OCP 4.14.14.
Used hardware: Airframe OE22 2U server CPU Intel(R) Xeon Intel(R) Xeon(R) Gold 6428N SPR-SP S3, (32 cores 64 threads) with 128GB memory.

After all vDU pods became running, a few minutes later the following alert was triggered:

"labels":

{ "alertname": "HighOverallControlPlaneCPU", "namespace": "openshift-kube-apiserver", "openshift_io_alert_source": "platform", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }

,
"annotations": {
"description": "Given three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity.
This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA.
If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load.
To fix this, increase the CPU and memory on your control plane nodes.",
"runbook_url": https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md,
"summary": "CPU utilization across all three control plane nodes is higher than two control plane nodes can sustain;
a single control plane node outage may cause a cascading failure; increase available CPU."

The alert description is misleading since this cluster is SNO, there is no HA in this cluster.
Increasing CPU capacity in SNO cluster is not an option.
Although the CPU usage is high, this alarm is not correct.
MNO and SNO clusters should have separate alert descriptions.

clones

OCPBUGS-35831 [release-4.16] Misleading alert regarding high control plane CPU utilization in Single Node OpenShift (SNO) cluster

Closed

depends on

OCPEDGE-827 Enable Workload Partitioning Metrics for SNO Alerting

Closed

is depended on by

OCPBUGS-35831 [release-4.16] Misleading alert regarding high control plane CPU utilization in Single Node OpenShift (SNO) cluster

Closed

is duplicated by

OCPBUGS-60142 False positive HighOverallControlPlaneCPU and/or ExtremelyHighIndividualControlPlaneCPU alerts on Scheduleable Masters

links to

OCPEDGE-902: add SNO control plane high cpu usage alert

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

(1 links to)

Assignee:: Bulat Zamalutdinov

Reporter:: Daniel Fröhlich

Contributors:: Chad Scribner

QA Contact:: Ke Wang

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/06/19 2:44 PM

Updated:: 2025/09/01 8:40 AM

Resolved:: 2024/10/01 5:35 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates