Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76538

Undesired Fencing and Failure to Recover in 2-Node Cluster (TNF) after a Sequential MachineConfig Update

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.21.0, 4.22.0
    • Two Node Fencing
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • 0
    • Important
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In a Two-Node Fencing (TNF) configuration, a sequential update triggered by a MachineConfig resulted in a total loss of etcd quorum and a fencing execution on the node that was already updated.
      
      During a sequential update, a node was fenced (rebooted) due to a short network response latency (approx. 3 seconds, just over the Pacemaker timeout default). While the fencing mechanism itself executed as designed, the cluster failed to recover automatically after the reboot, leading to a total etcd quorum loss and requiring manual intervention.
      
      The issue happened after the first node was updated and rebooted and while the second node was re-joining the cluster after being updated and rebooted too. At some point while the second node was joining and stabilizing the cluster, the first node was fenced and rebooted (again).
      
      Pacemaker logs reveal that the root cause of the reboot was a keepalive timeout.
      
      * Trigger: The survivor node failed to receive a response from the peer for approximately 3 seconds.
      * Fencing Action: Pacemaker triggered a STONITH (fence_redfish) operation correctly based on current default timeouts.
      * The Bug: The core issue is twofold:
         * Default Timeouts: Current timeouts (3s) are probably too aggressive for physical nodes/bare metal where brief latencies during high I/O (like a MachineConfig application) might be common.
         * Recovery Failure: Even though fencing worked, the cluster should have recovered once the node came back up. Instead, it entered a "Split-brain prevention" lock (Error 125 / Revision mismatch), leaving etcd stopped on both nodes.

      Version-Release number of selected component (if applicable):

          4.21.0 

      How reproducible:

          Not always. I guess unlike virtualized environments where the hypervisor abstracts and stabilizes timing, bare metal hardware is subject to raw I/O and CPU scheduling spikes. During heavy operations like a MachineConfig application, physical nodes can experience 'Stop-the-World' moments or I/O wait times that exceed the current 3-second threshold.

      Steps to Reproduce:

          1.Deploy a 2-Node baremetal Cluster with TNF enabled.
          2.Apply a MachineConfig that requires a reboot (e.g., modifying the SSH banner or a kernel parameter).
          3.Observe the MCO update cycle.
          4.The first node node initiated its update and reboot. 
          5. After that, the second node initiated its own update process. 
          6. Once the second node rebooted and was about to join the etcd cluster, the first node was shut down and started again (fencing mechanism was applied).

      Actual results:

      * Pre-Reboot: Survivor node stops receiving heartbeats for >3s.
      * Reboot: Peer node is hard-reset via Redfish.
      * Post-Reboot: The cluster stays in a "Stopped" state. Pacemaker logs show: error: local revision is older and peer is not starting.
      * Manual Fix: Requires pcs resource cleanup etcd-clone.

      Expected results:

      * Resilience: The system should tolerate minor latencies (e.g., 5-10 seconds) during maintenance operations.
      * Auto-Recovery: If a fencing event occurs, the cluster must be able to re-establish quorum automatically once the fenced node is back online without requiring a manual pcs resource cleanup.

      Additional info:

          Workaround: Manual cleaning of the pcs resources recovered the 2NF cluster.
      
      Error messages:
      
      * On worker-00: The error "local revision is older and peer is not starting" indicates that worker-00 recognizes its data is not the most up-to-date and is waiting for worker-01 to start. However, since worker-01 has also failed, the cluster is unable to proceed.
      * On worker-01: The error "podman failed to launch container (error code: 125)" typically signifies a conflict with an existing container, a file locking issue, or that the Podman storage is in an inconsistent state following the reboot.

              rh-ee-vmauro Vincenzo Mauro
              alosadag@redhat.com Alberto Losada
              Douglas Hensel Douglas Hensel
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: