Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62341

MCO may hang until timeout when cordoning a node

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • All
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During MCO updating a node, in certain conditions (e.g., unstable network), a failed cordon request can leave the Unschedulable state of the node instance held by MCO out of sync with the node’s actual Unschedulable state in the cluster, causing MCO to hang on that node until it eventually times out.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1. Create a MachineConfig that triggers a node drain and apply it.
          2. When a specific node starts updating, intercept the first cordon patch/update request and force it to fail.
          3. Observe the MCP status and wait for the operation to hang and eventually time out.     

      Actual results:

          MCO gets stuck on a specific node during update, repeatedly attempting cordon while the node state does not progress

      Expected results:

          Even if the first cordon attempt hits a network error, retries should be recoverable, converging to the correct Unschedulable state

      Additional info:

          

              cz21ok@gmail.com Chi Zhang
              cz21ok@gmail.com Chi Zhang
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: