-
Bug
-
Resolution: Unresolved
-
Normal
-
4.18
-
Quality / Stability / Reliability
-
False
-
-
1
-
None
-
None
-
None
-
None
-
MCO Sprint 277
-
1
-
In Progress
-
Bug Fix
-
-
None
-
None
-
None
-
None
This is a clone of issue OCPBUGS-61516. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-60537. The following is the description of the original issue:
—
Description of problem
In a 4.18.16 cluster, there were multiple MCCDrainError Nodes:
$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | select(.state == "firing" and (.labels.alertname | startswith("ClusterOperator") or startswith("MCC"))) | .activeAt + " " + .labels.severity + " " + .labe ls.alertname + " " + .labels.reason + " " + .labels.exported_node] | sort[]' 2025-08-11T18:41:47.704871087Z warning ClusterOperatorDegraded RequiredPoolsFailed 2025-08-11T18:47:17.704871087Z warning ClusterOperatorDegraded ClusterOperatorDegraded 2025-08-12T05:44:58.84453888Z warning MCCDrainError build04-g4f6n-ci-prowjobs-worker-b-fvwmw 2025-08-12T05:44:58.84453888Z warning MCCDrainError build04-g4f6n-ci-prowjobs-worker-b-pwjmp 2025-08-12T05:44:58.84453888Z warning MCCDrainError build04-g4f6n-ci-tests-worker-a-vs7lz
But none of the three were cordoned:
$ oc get -o json --show-managed-fields node build04-g4f6n-ci-prowjobs-worker-b-fvwmw build04-g4f6n-ci-prowjobs-worker-b-pwjmp build04-g4f6n-ci-tests-worker-a-vs7lz | grep -c unschedulable 0
which lead to the drain issues, as the machine-config controller kept evicting Pods from those Nodes, but not as fast as the scheduler placed new Pods on them.
Version-Release number of selected component
Seen in a 4.18.16 cluster.
How reproducible
Not clear. And sadly, the MCC logs in this cluster are way to young to talk about the original cordoning for any of the affected Nodes:
$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller -c machine-config-controller --tail -1 | head -n1 I0814 15:08:25.279721 1 drain_controller.go:153] evicting pod ci/16cf7b4a-4020-4465-9710-c2fb73a55a56
despite the fact that the MCC container is old enough:
$ oc -n openshift-machine-config-operator get -o json -l k8s-app=machine-config-controller pod | jq -c '.items[].status.containerStatuses[] | select(.name == "machine-config-controller").state' {"running":{"startedAt":"2025-08-11T17:41:15Z"}}
Steps to Reproduce
Unclear.
Actual results
The machine-config controller complains with MCCDrainError while trying to drain a Node that isn't cordoned/unschedulable.
Expected results
Seems like there would be lots of options, including:
- Don't try to drain uncordoned Nodes, just complain about them being uncordoned.
- Actively manage cordons while trying to drain, including stomping back to unschedulable: true if any other actor trys to uncordon something we're trying to drain.
- blocks
-
OCPBUGS-62637 Machine-config controller should actively manage cordon while draining
-
- Verified
-
- clones
-
OCPBUGS-61516 Machine-config controller should actively manage cordon while draining
-
- Verified
-
- is blocked by
-
OCPBUGS-61516 Machine-config controller should actively manage cordon while draining
-
- Verified
-
- is cloned by
-
OCPBUGS-62637 Machine-config controller should actively manage cordon while draining
-
- Verified
-
- links to