Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: openshift-4.12
Component/s: kube-apiserver
Labels:
- telco-5g-ran

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:
PX Impact Score:
PX Priority Data:
PX Review Complete:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1. Proposed title of this feature request

Adjust the HighOverallControlPlaneCPU alert for SNO and clarify the recommended overall CPU and memory resource usage for SNO deployments. For a deployment with 3 master nodes, the recommendation is 60%, and the HighOverallControlPlaneCPU alert is set accordingly

2. What is the nature and description of the request?

As per our documentation on "Control plane node sizing", for larger clusters, we indeed recommend keeping the overall CPU and memory resource usage on the control plane nodes to a maximum of 60% which is aligned to HighOverallControlPlaneCPU alert . This guideline is set to ensure that the system can handle sudden spikes in resource usage and prevent potential cascading failures but for clusters with 3 master nodes. The statement from our documentation reads: "To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes."

However, even in SNO deployments, where one might expect different thresholds, Prometheus alert "HighOverallControlPlaneCPU" is triggered based on these thresholds.

$ oc get infrastructure cluster -o jsonpath='

{.status.infrastructureTopology}

'
SingleReplica

$ oc get prometheusrules cpu-utilization -n openshift-kube-apiserver -o yaml | grep HighOverallControlPlaneCPU
...

alert: HighOverallControlPlaneCPU
annotations:
description: Given three control plane nodes, the overall CPU utilization
may only be about 2/3 of all available capacity. This is because if a single
control plane node fails, the remaining two must handle the load of the
cluster in order to be HA. If the cluster is using more than 2/3 of all
capacity, if one control plane node fails, the remaining two are likely
to fail when they take the load. To fix this, increase the CPU and memory
on your control plane nodes.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md
summary: CPU utilization across all three control plane nodes is higher than
two control plane nodes can sustain; a single control plane node outage
may cause a cascading failure; increase available CPU.
expr: |
sum(
100 - (avg by (instance) (rate(node_cpu_seconds_total {mode="idle"}
[1m])) * 100)
AND on (instance) label_replace( kube_node_role
{role="master"}, "instance", "$1", "node", "(.+)" )
)
/
count(kube_node_role{role="master"}
)
> 60
...

3. Why does the customer need this? (List the business requirements here)

As a customer, we require a recommendation for maximum CPU/memory usage when conducting load testing on SNO as we have for a deployment with 3 master nodes

4. List any affected packages or components.

cluster-kube-apiserver-operator

https://github.com/openshift/cluster-kube-apiserver-operator/blob/a66fe80f9ce087436b7bdd2e05af512892864664/bindata/assets/alerts/cpu-utilization.yaml#L10

is related to

OCPBUGS-22117 HighOverallControlPlaneCPU alert for SNO has wrong threshold

Closed

OCPBUGS-27842 Add SNO to HighOverallControlPlaneCPU alert description

Closed

Assignee:: Daniel Fröhlich

Reporter:: Jorge Claret Membrado

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/26 1:04 PM

Updated:: 2024/03/25 5:30 AM

Resolved:: 2023/10/19 1:57 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates