Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13, 4.16
Component/s: Machine Config Operator
Labels:
- good-first-mco-issue
- mco-triaged

Severity:
Low
Regression:
No
Sprint:
MCO Sprint 257
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
Prevent no-operation patches when a node status is updated, from overloading the kube-api server with unnecessary calls
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.17.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

I haven't traced out the trigger-pathway yet, but 4.13 and 4.16 machine-config controllers seem to react to Node status updates with makeMasterNodeUnSchedulable calls that result in Kubernetes API PATCH calls, even when the patch being requested is empty. This creates unnecessary API volume, loading the control plane and resulting in distracting Kube-API audit log activity.

Version-release number of selected component (if applicable)

Seen in 4.13.34, and reproduced in 4.16 CI builds, so likely all intervening versions. Possibly all versions.

How reproducible

Every time.

Steps to reproduce

mco#4277 reproduces this by reverting ~~OCPBUGS-29713~~ to get frequent Node status updates. And then in presubmit CI, e2e-aws-ovn > Artifacts > ... > gather-extra pod logs :

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | tail

and:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | grep -o 'makeMasterNodeUnSchedulable\|UpdateNodeRetry' | sort | uniq -c

give...

Actual results

I0322 02:10:13.027938       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal
I0322 02:10:13.029671       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {}
I0322 02:10:13.669568       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal
I0322 02:10:13.671023       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {}
I0322 02:10:21.095260       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-114-0.us-east-2.compute.internal
I0322 02:10:21.098410       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-114-0.us-east-2.compute.internal: {}
I0322 02:10:23.215168       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal
I0322 02:10:23.219672       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {}
I0322 02:10:24.049456       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal
I0322 02:10:24.050939       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {}

showing frequent, no-op patch attempts and:

   1408 makeMasterNodeUnSchedulable
   1414 UpdateNodeRetry

showing many attempts over the life of the MCC container.

Expected results

No need to PATCH on makeMasterNodeUnSchedulable unless the generated patch content contained more than a no-op patch.

Additional info

setDesiredMachineConfigAnnotation has a "DesiredMachineConfigAnnotationKey already matches what I want" no-op out here

setUpdateInProgressTaint seems to lack a similar guard here , and it should probably grow a check for "NodeUpdateInProgressTaint is already present". Same for removeUpdateInProgressTaint. But the hot loop in response to these Node status updates is makeMasterNodeUnSchedulable calling UpdateNodeRetry, and UpdateNodeRetry also lacks this kind of "no need to PATCH when no changes are requested" logic.

For this bug, we should:

Add that kind of guard somewhere to the makeMasterNodeUnSchedulable stack. I'd personally recommend putting it down at UpdateNodeRetry.
Optionally add similar guards to setUpdateInProgressTaint and removeUpdateInProgressTaint (or port those to use UpdateNodeRetry? If you port, you might want to port setDesiredMachineConfigAnnotation to, just for consistency).
Optionally add logging like I'm floating in mco#4277, so it's easier to understand why the MCC thinks it needs to take externally-visible action, to help debug future "these Kube API server audit logs have the MCC doing surprising stuff..." situations.

links to

openshift/machine-config-operator#4511: OCPBUGS-31298: Prevent no-operation patches from overloading the kube-api server with unnecessary calls

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Rishabh Saini

Reporter:: W. Trevor King

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/22 3:32 AM

Updated:: 2024/10/01 5:40 PM

Resolved:: 2024/10/01 5:40 PM

Details

Description

Description of problem

Version-release number of selected component (if applicable)

How reproducible

Steps to reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates