-
Bug
-
Resolution: Unresolved
-
Normal
-
4.19, 4.20
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
All
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During MCO updating a node, in certain conditions (e.g., unstable network), a failed cordon request can leave the Unschedulable state of the node instance held by MCO out of sync with the node’s actual Unschedulable state in the cluster, causing MCO to hang on that node until it eventually times out.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a MachineConfig that triggers a node drain and apply it. 2. When a specific node starts updating, intercept the first cordon patch/update request and force it to fail. 3. Observe the MCP status and wait for the operation to hang and eventually time out.
Actual results:
MCO gets stuck on a specific node during update, repeatedly attempting cordon while the node state does not progress
Expected results:
Even if the first cordon attempt hits a network error, retries should be recoverable, converging to the correct Unschedulable state
Additional info:
- blocks
-
OCPBUGS-63126 MCO may hang until timeout when cordoning a node
-
- New
-
-
OCPBUGS-63127 MCO may hang until timeout when cordoning a node
-
- POST
-
- is cloned by
-
OCPBUGS-63126 MCO may hang until timeout when cordoning a node
-
- New
-
-
OCPBUGS-63127 MCO may hang until timeout when cordoning a node
-
- POST
-
- links to