Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.21.0, 4.22.0
Component/s: Two Node Fencing
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
0
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.22.0
Release Blocker:
Proposed
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In a Two-Node Fencing (TNF) configuration, a sequential update triggered by a MachineConfig resulted in a total loss of etcd quorum and a fencing execution on the node that was already updated.

During a sequential update, a node was fenced (rebooted) due to a short network response latency (approx. 3 seconds, just over the Pacemaker timeout default). While the fencing mechanism itself executed as designed, the cluster failed to recover automatically after the reboot, leading to a total etcd quorum loss and requiring manual intervention.

The issue happened after the first node was updated and rebooted and while the second node was re-joining the cluster after being updated and rebooted too. At some point while the second node was joining and stabilizing the cluster, the first node was fenced and rebooted (again).

Pacemaker logs reveal that the root cause of the reboot was a keepalive timeout.

* Trigger: The survivor node failed to receive a response from the peer for approximately 3 seconds.
* Fencing Action: Pacemaker triggered a STONITH (fence_redfish) operation correctly based on current default timeouts.
* The Bug: The core issue is twofold:
* Default Timeouts: Current timeouts (3s) are probably too aggressive for physical nodes/bare metal where brief latencies during high I/O (like a MachineConfig application) might be common.
* Recovery Failure: Even though fencing worked, the cluster should have recovered once the node came back up. Instead, it entered a "Split-brain prevention" lock (Error 125 / Revision mismatch), leaving etcd stopped on both nodes.

Version-Release number of selected component (if applicable):

    4.21.0

How reproducible:

    Not always. I guess unlike virtualized environments where the hypervisor abstracts and stabilizes timing, bare metal hardware is subject to raw I/O and CPU scheduling spikes. During heavy operations like a MachineConfig application, physical nodes can experience 'Stop-the-World' moments or I/O wait times that exceed the current 3-second threshold.

Steps to Reproduce:

    1.Deploy a 2-Node baremetal Cluster with TNF enabled.
    2.Apply a MachineConfig that requires a reboot (e.g., modifying the SSH banner or a kernel parameter).
    3.Observe the MCO update cycle.
    4.The first node node initiated its update and reboot. 
    5. After that, the second node initiated its own update process. 
    6. Once the second node rebooted and was about to join the etcd cluster, the first node was shut down and started again (fencing mechanism was applied).

Actual results:

* Pre-Reboot: Survivor node stops receiving heartbeats for >3s.
* Reboot: Peer node is hard-reset via Redfish.
* Post-Reboot: The cluster stays in a "Stopped" state. Pacemaker logs show: error: local revision is older and peer is not starting.
* Manual Fix: Requires pcs resource cleanup etcd-clone.

Expected results:

* Resilience: The system should tolerate minor latencies (e.g., 5-10 seconds) during maintenance operations.
* Auto-Recovery: If a fencing event occurs, the cluster must be able to re-establish quorum automatically once the fenced node is back online without requiring a manual pcs resource cleanup.

Additional info:

    Workaround: Manual cleaning of the pcs resources recovered the 2NF cluster.

Error messages:

* On worker-00: The error "local revision is older and peer is not starting" indicates that worker-00 recognizes its data is not the most up-to-date and is waiting for worker-01 to start. However, since worker-01 has also failed, the cluster is unable to proceed.
* On worker-01: The error "podman failed to launch container (error code: 125)" typically signifies a conflict with an existing container, a file locking issue, or that the Podman storage is in an inconsistent state following the reboot.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

pacemaker-log-daniel-march-4th.txt
13 kB
2026/03/04 5:17 PM

links to

openshift/cluster-etcd-operator#1560: OCPBUGS-76538: Removed unneeded ETCDCTL_API environment variable

Assignee:: Vincenzo Mauro

Reporter:: Alberto Losada

QA Contact:: Douglas Hensel

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2026/02/11 9:57 AM

Updated:: 2026/03/09 1:45 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates