[OCPBUGS-29713] Excessive node status updates causing high control plane CPU - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.11, 4.14, 4.15, 4.16
Component/s: Machine Config Operator
Labels:

Test Coverage:

+
Regression:
No
Sprint:
MCO Sprint 249, MCO Sprint 250
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the default value of the `nodeStatusUpdateFrequency` parameter was changed from `0s` to `10s`. This change inadvertently caused the `nodeStatusReportFrequency` to increase significantly, because the value was linked to the `nodeStatusReportFrequency` value. This resulted in high CPU usage on control plane operators and the API server. This fix manually sets the `nodeStatusReportFrequency` value to `5m`, which prevents this high CPU usage. (link:https://issues.redhat.com/browse/OCPBUGS-29713[*~~OCPBUGS-29713~~*])

Show
* Previously, the default value of the `nodeStatusUpdateFrequency` parameter was changed from `0s` to `10s`. This change inadvertently caused the `nodeStatusReportFrequency` to increase significantly, because the value was linked to the `nodeStatusReportFrequency` value. This resulted in high CPU usage on control plane operators and the API server. This fix manually sets the `nodeStatusReportFrequency` value to `5m`, which prevents this high CPU usage. (link: https://issues.redhat.com/browse/OCPBUGS-29713 [* OCPBUGS-29713 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. 

The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery).

Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.

Version-Release number of selected component (if applicable):

Versions where OCPBUGS-15583 was backported. This includes 4.16, 4.15.0, 4.14.8, 4.13.33, and the next 4.12.z likely 4.12.51.

How reproducible:

always

Steps to Reproduce:

1. create a cluster that contains a fix for OCPBUGS-15583
2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET
    alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))

Actual results:

the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver

Expected results:

the node status should not update that frequently, meaning the control plane CPU usage should go down again

Additional info:

slack thread with the node team:
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849

blocks

OCPBUGS-29797 Excessive node status updates causing high control plane CPU

Closed

causes

OCPBUGS-29716 Replace nodelister with master nodelister everywhere

Closed

OCPBUGS-30002 Overall cpu utilization and per pod management CPU utilization higher in 4.15 than 4.14

Closed

is blocked by

MCO-1094 Impact Excessive node status updates causing high control plane CPU

Closed

is caused by

OCPBUGS-15583 MachineConfig rollout after Control-Plane Node(s) CPU and Memory update because of nodeStatusUpdateFrequency being updated

Closed

is cloned by

OCPBUGS-29797 Excessive node status updates causing high control plane CPU

Closed

relates to

OCPBUGS-15583 MachineConfig rollout after Control-Plane Node(s) CPU and Memory update because of nodeStatusUpdateFrequency being updated

Closed

links to

openshift/machine-config-operator#4204: OCPBUGS-29713: set nodeStatusReportFrequency

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(1 is cloned by, 1 relates to, 2 links to)

Assignee:: Charles Doern

Reporter:: Thomas Jungblut

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2024/02/20 3:35 PM

Updated:: 2024/06/27 11:38 AM

Resolved:: 2024/06/27 11:38 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide