Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63400

Machine deletion due to missing NodeRef can disrupt workloads

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description: 

      A Machine object can be created successfully, and the underlying node may join the cluster and start executing workloads. However, if the Node Link Controller fails to add a NodeRef to the Machine, the MHC may mark the Machine as unhealthy and delete it. Since the Machine has no NodeRef, workloads are not drained before deletion, which can cause service disruption.

      This scenario was observed during an AWS outage, but the root cause is environment-agnostic and could occur in other cloud or on-premises setups where node registration is delayed or fails.

      Version-Release number of selected component (if applicable):

          4.16.z observed, applies to later version too.

      How reproducible:

      - Occurs in situations where a Machine is created but NodeRef assignment fails.
      - Exact reproducibility depends on timing of Machine creation and node registration.

      Steps to Reproduce:

      1) Create a Machine in an environment where NodeRef assignment is delayed or fails (e.g., cloud outage scenario).
      
      2) Observe the Node Link Controller failing to add a NodeRef.
      
      3) Allow Machine Health Check to detect the Machine as unhealthy.
      
      4) Observe the Machine being deleted without draining workloads on the underlying node.

      Actual results:

      - Machine is deleted despite being partially operational.
      - Workloads on the node continue temporarily but are eventually disrupted.

      Expected results:

      - No workload disruption should occur from this scenario. 

      Additional info:

      • CAPI uses node.cluster.x-k8s.io/uninitialized taint during bootstrap to prevent workloads from scheduling before node initialization (source).
      • MAPI could consider a similar mechanism to avoid workload disruption.
      • Future improvements may include:
        • Updating machine configs to allow additional taints via drop-ins or environment variables.
        • Creating a small bootstrap provider that injects ignition with taints.
        • Ensuring MAPI/CAPI remove temporary taints post-initialization.

              rhn-gps-mbooth Matthew Booth
              cbusse.openshift Claudio Busse
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: