-
Bug
-
Resolution: Done-Errata
-
Major
-
4.13, 4.12, 4.11, 4.14, 4.15, 4.16
-
No
-
MCO Sprint 249, MCO Sprint 250
-
2
-
Proposed
-
False
-
-
-
Bug Fix
-
In Progress
This is a clone of issue OCPBUGS-29713. The following is the description of the original issue:
—
Description of problem:
OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery). Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.
Version-Release number of selected component (if applicable):
any version that OCPBUGS-15583 was backported to, 4.16 down to 4.11 AFAIU
How reproducible:
always
Steps to Reproduce:
1. create a cluster that contains a fix for OCPBUGS-15583 2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))
Actual results:
the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver
Expected results:
the node status should not update that frequently, meaning the control plane CPU usage should go down again
Additional info:
slack thread with the node team: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
- blocks
-
OCPBUGS-30225 Excessive node status updates causing high control plane CPU
- Closed
- clones
-
OCPBUGS-29713 Excessive node status updates causing high control plane CPU
- Closed
- is blocked by
-
OCPBUGS-29713 Excessive node status updates causing high control plane CPU
- Closed
- is cloned by
-
OCPBUGS-30225 Excessive node status updates causing high control plane CPU
- Closed
- links to
-
RHSA-2024:1210 OpenShift Container Platform 4.15.z security update