Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61516

Machine-config controller should actively manage cordon while draining

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • None
    • None
    • None
    • None
    • MCO Sprint 276, MCO Sprint 277
    • 2
    • Done
    • Bug Fix
    • Hide
      Previously, an external actor could uncordon a node that the MCO is draining. As a consequence, the MCO and the scheduler would schedule and unschedule pods at the same time, prolonging the drain process. With this fix, the MCO attempts to recordon the node if an external actor uncordons it during the drain process. As a result the MCO and scheduler no longer schedule and remove pods at the same time. (link:https://issues.redhat.com/browse/OCPBUGS-61516[OCPBUGS-61516])
      Show
      Previously, an external actor could uncordon a node that the MCO is draining. As a consequence, the MCO and the scheduler would schedule and unschedule pods at the same time, prolonging the drain process. With this fix, the MCO attempts to recordon the node if an external actor uncordons it during the drain process. As a result the MCO and scheduler no longer schedule and remove pods at the same time. (link: https://issues.redhat.com/browse/OCPBUGS-61516 [ OCPBUGS-61516 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-60537. The following is the description of the original issue:

      Description of problem

      In a 4.18.16 cluster, there were multiple MCCDrainError Nodes:

      $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | select(.state == "firing" and (.labels.alertname | startswith("ClusterOperator") or startswith("MCC"))) | .activeAt + " " + .labels.severity + " " + .labe
      ls.alertname + " " + .labels.reason + " " + .labels.exported_node] | sort[]'
      2025-08-11T18:41:47.704871087Z warning ClusterOperatorDegraded RequiredPoolsFailed 
      2025-08-11T18:47:17.704871087Z warning ClusterOperatorDegraded ClusterOperatorDegraded 
      2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-prowjobs-worker-b-fvwmw
      2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-prowjobs-worker-b-pwjmp
      2025-08-12T05:44:58.84453888Z warning MCCDrainError  build04-g4f6n-ci-tests-worker-a-vs7lz
      

      But none of the three were cordoned:

      $ oc get -o json --show-managed-fields node build04-g4f6n-ci-prowjobs-worker-b-fvwmw build04-g4f6n-ci-prowjobs-worker-b-pwjmp build04-g4f6n-ci-tests-worker-a-vs7lz | grep -c unschedulable
      0
      

      which lead to the drain issues, as the machine-config controller kept evicting Pods from those Nodes, but not as fast as the scheduler placed new Pods on them.

      Version-Release number of selected component

      Seen in a 4.18.16 cluster.

      How reproducible

      Not clear. And sadly, the MCC logs in this cluster are way to young to talk about the original cordoning for any of the affected Nodes:

      $ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-controller -c machine-config-controller --tail -1 | head -n1
      I0814 15:08:25.279721       1 drain_controller.go:153] evicting pod ci/16cf7b4a-4020-4465-9710-c2fb73a55a56
      

      despite the fact that the MCC container is old enough:

      $ oc -n openshift-machine-config-operator get -o json -l k8s-app=machine-config-controller pod | jq -c '.items[].status.containerStatuses[] | select(.name == "machine-config-controller").state'
      {"running":{"startedAt":"2025-08-11T17:41:15Z"}}
      

      Steps to Reproduce

      Unclear.

      Actual results

      The machine-config controller complains with MCCDrainError while trying to drain a Node that isn't cordoned/unschedulable.

      Expected results

      Seems like there would be lots of options, including:

      • Don't try to drain uncordoned Nodes, just complain about them being uncordoned.
      • Actively manage cordons while trying to drain, including stomping back to unschedulable: true if any other actor trys to uncordon something we're trying to drain.

              djoshy David Joshy
              trking W. Trevor King
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: