-
Bug
-
Resolution: Done-Errata
-
Major
-
4.13, 4.12, 4.11, 4.14, 4.15, 4.16
-
+
-
No
-
MCO Sprint 249, MCO Sprint 250
-
2
-
Rejected
-
False
-
-
-
Bug Fix
-
Done
Description of problem:
OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery). Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.
Version-Release number of selected component (if applicable):
Versions where OCPBUGS-15583 was backported. This includes 4.16, 4.15.0, 4.14.8, 4.13.33, and the next 4.12.z likely 4.12.51.
How reproducible:
always
Steps to Reproduce:
1. create a cluster that contains a fix for OCPBUGS-15583 2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))
Actual results:
the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver
Expected results:
the node status should not update that frequently, meaning the control plane CPU usage should go down again
Additional info:
slack thread with the node team: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
- blocks
-
OCPBUGS-29797 Excessive node status updates causing high control plane CPU
- Closed
- causes
-
OCPBUGS-29716 Replace nodelister with master nodelister everywhere
- Closed
-
OCPBUGS-30002 Overall cpu utilization and per pod management CPU utilization higher in 4.15 than 4.14
- Closed
- is blocked by
-
MCO-1094 Impact Excessive node status updates causing high control plane CPU
- Closed
- is caused by
-
OCPBUGS-15583 MachineConfig rollout after Control-Plane Node(s) CPU and Memory update because of nodeStatusUpdateFrequency being updated
- Closed
- is cloned by
-
OCPBUGS-29797 Excessive node status updates causing high control plane CPU
- Closed
- relates to
-
OCPBUGS-15583 MachineConfig rollout after Control-Plane Node(s) CPU and Memory update because of nodeStatusUpdateFrequency being updated
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update