[OCPBUGS-18097] OCP 4.14 increased rate of patch nodes requests from node SAs - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: Machine Config Operator
Labels:
- blue
- mco-triaged
- triaged

Regression:
No
Sprint:
OCPNODE Sprint 241 (Blue)
sprint_count:
1
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

With 120+ node clusters, we are seeing O(10) larger rate of patch node requests coming from node service accounts.  These higher rate of updates are causing issues where "nodes" watchers are being terminated, causing storm of watch requests that increases CPU load on the cluster.

What I see is node resourceVersions are incremented rapidly and in large bursts and watchers are terminated as a result.

Version-Release number of selected component (if applicable):

4.14.0-ec.4
4.14.0-0.nightly-2023-08-08-222204
4.13.0-0.nightly-2023-08-10-021434

How reproducible:

Repeatable

Steps to Reproduce:

1. Create 4.14 cluster with 120 nodes with m5.8xlarge control plane and c5.4xlarge workers.
2. Run `oc get nodes -w -o custom-columns='NAME:.metadata.name,RV:.metadata.resourceVersion' ` 
3. Wait for a big chunk of nodes to be updated and observe the watch terminate.
4. Optionally run `kube-burner ocp node-density-cni --pods-per-node=100` to generate some load.

Actual results:

kube-apiserver audit events show >1500 node patch requests from a single node SA in a certain amount of time:
   1678 ["system:node:ip-10-0-69-142.us-west-2.compute.internal",null]
   1679 ["system:node:ip-10-0-33-131.us-west-2.compute.internal",null]
   1709 ["system:node:ip-10-0-41-44.us-west-2.compute.internal",null]

Observe that apiserver_terminated_watchers_total{resource="nodes"} starts to increment before 120 node scaleup is even complete.

Expected results:

patch requests in certain amount of time are more aligned with what we see on 4.13*08-10* nightly:
     57 ["system:node:ip-10-0-247-122.us-west-2.compute.internal",null]
     62 ["system:node:ip-10-0-239-217.us-west-2.compute.internal",null]
     63 ["system:node:ip-10-0-165-255.us-west-2.compute.internal",null]
     64 ["system:node:ip-10-0-136-122.us-west-2.compute.internal",null]

Observe that apiserver_terminated_watchers_total{resource="nodes"} does not increment.

Observe that rate of mutating node requests levels off after nodes are created.

Additional info:

Suspecting these updates coming from nodes could be part of a response to the MCO controllerconfigs resource being updated every few minutes or more frequently.

One of the suspected causes of increased kube-apiserer CPU usage investigation of ovn-ic.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

machine-config-daemon-bmcdd.log
233 kB
2023/08/25 5:05 PM

causes

API-1648 ocp 4.14 kube-apiserver dramatic increase in rate of terminated node watchers

Closed

relates to

API-1648 ocp 4.14 kube-apiserver dramatic increase in rate of terminated node watchers

Closed

OCPBUGS-17777 [OVN-IC] 34% increase in avg kube-apiserver CPU usage in OVN-IC compared to OpenshiftSDN

Closed

links to

openshift/machine-config-operator#3891: OCPBUGS-18097: ensure cconfig is not updated too frequently

RHEA-2023:5006 rpm

Assignee:: Charles Doern

Reporter:: Andrew Collins

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2023/08/24 10:36 PM

Updated:: 2023/10/31 1:36 PM

Resolved:: 2023/10/31 1:36 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide