Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.z
Affects Version/s: 4.13, 4.12, 4.11, 4.14, 4.15, 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged
- pre-merge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Proposed
Sprint:
MCO Sprint 249, MCO Sprint 250
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, the `nodeStatusReportFrequency` was linked to the `nodeStatusUpdateFrequency`. With this release, the `nodeStatusReportFrequency` is set to 5 minutes. (link:https://issues.redhat.com/browse/OCPBUGS-29797[*~~OCPBUGS-29797~~*])

____________________
When nodeStatusUpdateFrequency's default changed from 0s to 10s in the MCO, we inadvertently caused the nodeStatusReportFrequency to increase as the logic to set its higher value was linked to the 0s of the nodeStatusReportFrequency. We need to manually set nodeStatusReportFrequency to 5m

Show
* Previously, the `nodeStatusReportFrequency` was linked to the `nodeStatusUpdateFrequency`. With this release, the `nodeStatusReportFrequency` is set to 5 minutes. (link: https://issues.redhat.com/browse/OCPBUGS-29797 [* OCPBUGS-29797 *]) ____________________ When nodeStatusUpdateFrequency's default changed from 0s to 10s in the MCO, we inadvertently caused the nodeStatusReportFrequency to increase as the logic to set its higher value was linked to the 0s of the nodeStatusReportFrequency. We need to manually set nodeStatusReportFrequency to 5m

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-29713~~. The following is the description of the original issue:
—
Description of problem:

OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. 

The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery).

Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.

Version-Release number of selected component (if applicable):

any version that OCPBUGS-15583 was backported to, 4.16 down to 4.11 AFAIU

How reproducible:

always

Steps to Reproduce:

1. create a cluster that contains a fix for OCPBUGS-15583
2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET
    alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))

Actual results:

the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver

Expected results:

the node status should not update that frequently, meaning the control plane CPU usage should go down again

Additional info:

slack thread with the node team:
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849

blocks

OCPBUGS-30225 Excessive node status updates causing high control plane CPU

Closed

clones

OCPBUGS-29713 Excessive node status updates causing high control plane CPU

Closed

is blocked by

OCPBUGS-29713 Excessive node status updates causing high control plane CPU

Closed

is cloned by

OCPBUGS-30225 Excessive node status updates causing high control plane CPU

Closed

links to

openshift/machine-config-operator#4211: [release-4.15] OCPBUGS-29797: set nodeStatusReportFrequency

RHSA-2024:1210 OpenShift Container Platform 4.15.z security update

(1 links to)

Assignee:: Team MCO

Reporter:: OpenShift Prow Bot

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/02/21 11:43 PM

Updated:: 2025/07/23 5:27 PM

Resolved:: 2024/03/13 3:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide