Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.19.z
Affects Version/s: 4.18
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.19.z
Release Blocker:
None
Sprint:
MCO Sprint 277
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, an external actor could uncordon a node that the MCO is draining. As a consequence, the MCO and the scheduler would schedule and unschedule pods at the same time, prolonging the drain process. With this fix, the MCO attempts to recordon the node if an external actor uncordons it during the drain process. As a result the MCO and scheduler no longer schedule and remove pods at the same time. (link:https://issues.redhat.com/browse/OCPBUGS-62003[~~OCPBUGS-62003~~])

Show
Before this update, an external actor could uncordon a node that the MCO is draining. As a consequence, the MCO and the scheduler would schedule and unschedule pods at the same time, prolonging the drain process. With this fix, the MCO attempts to recordon the node if an external actor uncordons it during the drain process. As a result the MCO and scheduler no longer schedule and remove pods at the same time. (link: https://issues.redhat.com/browse/OCPBUGS-62003 [ OCPBUGS-62003 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-61516~~. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-60537. The following is the description of the original issue:
—

Description of problem

In a 4.18.16 cluster, there were multiple MCCDrainError Nodes:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | select(.state == "firing" and (.labels.alertname | startswith("ClusterOperator") or startswith("MCC"))) | .activeAt + " " + .labels.severity + " " + .labe
ls.alertname + " " + .labels.reason + " " + .labels.exported_node] | sort[]'
2025-08-11T18:41:47.704871087Z warning ClusterOperatorDegraded RequiredPoolsFailed 
2025-08-11T18:47:17.704871087Z warning ClusterOperatorDegraded ClusterOperatorDegraded 
2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-prowjobs-worker-b-fvwmw
2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-prowjobs-worker-b-pwjmp
2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-tests-worker-a-vs7lz

But none of the three were cordoned:

$ oc get -o json --show-managed-fields node build04-g4f6n-ci-prowjobs-worker-b-fvwmw build04-g4f6n-ci-prowjobs-worker-b-pwjmp build04-g4f6n-ci-tests-worker-a-vs7lz | grep -c unschedulable
0

which lead to the drain issues, as the machine-config controller kept evicting Pods from those Nodes, but not as fast as the scheduler placed new Pods on them.

Version-Release number of selected component

Seen in a 4.18.16 cluster.

How reproducible

Not clear. And sadly, the MCC logs in this cluster are way to young to talk about the original cordoning for any of the affected Nodes:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller -c machine-config-controller --tail -1 | head -n1
I0814 15:08:25.279721       1 drain_controller.go:153] evicting pod ci/16cf7b4a-4020-4465-9710-c2fb73a55a56

despite the fact that the MCC container is old enough:

$ oc -n openshift-machine-config-operator get -o json -l k8s-app=machine-config-controller pod | jq -c '.items[].status.containerStatuses[] | select(.name == "machine-config-controller").state'
{"running":{"startedAt":"2025-08-11T17:41:15Z"}}

Steps to Reproduce

Unclear.

Actual results

The machine-config controller complains with MCCDrainError while trying to drain a Node that isn't cordoned/unschedulable.

Expected results

Seems like there would be lots of options, including:

Don't try to drain uncordoned Nodes, just complain about them being uncordoned.
Actively manage cordons while trying to drain, including stomping back to unschedulable: true if any other actor trys to uncordon something we're trying to drain.

blocks

OCPBUGS-62637 Machine-config controller should actively manage cordon while draining

Closed

clones

OCPBUGS-61516 Machine-config controller should actively manage cordon while draining

Closed

is blocked by

OCPBUGS-61516 Machine-config controller should actively manage cordon while draining

Closed

is cloned by

OCPBUGS-62637 Machine-config controller should actively manage cordon while draining

Closed

links to

openshift/machine-config-operator#5300: [release-4.19] OCPBUGS-62003: Machine-config controller should actively manage cordon while draining

Assignee:: David Joshy

Reporter:: W. Trevor King

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/09/19 8:11 PM

Updated:: 2025/10/14 10:02 AM

Resolved:: 2025/10/14 10:02 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide