-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14
-
No
-
OCPNODE Sprint 241 (Blue)
-
1
-
Proposed
-
False
-
Description of problem:
With 120+ node clusters, we are seeing O(10) larger rate of patch node requests coming from node service accounts. These higher rate of updates are causing issues where "nodes" watchers are being terminated, causing storm of watch requests that increases CPU load on the cluster. What I see is node resourceVersions are incremented rapidly and in large bursts and watchers are terminated as a result.
Version-Release number of selected component (if applicable):
4.14.0-ec.4 4.14.0-0.nightly-2023-08-08-222204 4.13.0-0.nightly-2023-08-10-021434
How reproducible:
Repeatable
Steps to Reproduce:
1. Create 4.14 cluster with 120 nodes with m5.8xlarge control plane and c5.4xlarge workers. 2. Run `oc get nodes -w -o custom-columns='NAME:.metadata.name,RV:.metadata.resourceVersion' ` 3. Wait for a big chunk of nodes to be updated and observe the watch terminate. 4. Optionally run `kube-burner ocp node-density-cni --pods-per-node=100` to generate some load.
Actual results:
kube-apiserver audit events show >1500 node patch requests from a single node SA in a certain amount of time: 1678 ["system:node:ip-10-0-69-142.us-west-2.compute.internal",null] 1679 ["system:node:ip-10-0-33-131.us-west-2.compute.internal",null] 1709 ["system:node:ip-10-0-41-44.us-west-2.compute.internal",null] Observe that apiserver_terminated_watchers_total{resource="nodes"} starts to increment before 120 node scaleup is even complete.
Expected results:
patch requests in certain amount of time are more aligned with what we see on 4.13*08-10* nightly: 57 ["system:node:ip-10-0-247-122.us-west-2.compute.internal",null] 62 ["system:node:ip-10-0-239-217.us-west-2.compute.internal",null] 63 ["system:node:ip-10-0-165-255.us-west-2.compute.internal",null] 64 ["system:node:ip-10-0-136-122.us-west-2.compute.internal",null] Observe that apiserver_terminated_watchers_total{resource="nodes"} does not increment. Observe that rate of mutating node requests levels off after nodes are created.
Additional info:
Suspecting these updates coming from nodes could be part of a response to the MCO controllerconfigs resource being updated every few minutes or more frequently. One of the suspected causes of increased kube-apiserer CPU usage investigation of ovn-ic.
- causes
-
API-1648 ocp 4.14 kube-apiserver dramatic increase in rate of terminated node watchers
- Closed
- relates to
-
API-1648 ocp 4.14 kube-apiserver dramatic increase in rate of terminated node watchers
- Closed
-
OCPBUGS-17777 [OVN-IC] 34% increase in avg kube-apiserver CPU usage in OVN-IC compared to OpenshiftSDN
- Closed
- links to
-
RHEA-2023:5006 rpm