Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.20
Component/s: Two Node Fencing
Labels:
- ocpedge

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During fencing validation, after a fencing event, one control-plane node (master-1) is reported as started + learner in etcdctl member list when checked from the peer (master-0). However, on master-1 itself there is no etcd container running.
Pacemaker logs from both nodes reveal a fencing sequence where:

master-0 is fenced and reboots.

master-1 experiences a failure and restarts as well.

Upon restart, both nodes compare etcd revisions.

The node with a newer revision (force-new-cluster) wins; the other becomes a learner.

In this instance, both agents restarted almost simultaneously after fencing, leading to a split-brain-like situation where master-1 ends up marked as learner with no etcd container running.

 How reproducible: 70%

Steps to Reproduce:

Deploy a tnf setup.


Trigger fencing of one node. (pcs fence)


Observe Pacemaker and podman-etcd logs from both nodes.


In some cases, both nodes restart nearly simultaneously, causing revision comparison to misbehave.

Actual results:

 Fenced node joins as learner, etcd container not started.

Expected results:

 The fenced node restarts after the leader or in a controlled sequence, ensuring etcd container starts and node joins as full member.

Additional info:

Suggested Fix:
Per Carlo’s investigation, a quick mitigation could be:

Have the node with the force-new-cluster attribute push a higher revision to the CIB upon restart to ensure it wins the comparison.

is cloned by

OCPBUGS-65540 pcs debug-start can fail due to node already being marked as a learner

Assignee:: Carlo Lobrano

Reporter:: Neil Hamza

Need Info From:: None

Contributors:: None

QA Contact:: Douglas Hensel

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/09/02 12:15 PM

Updated:: 2025/11/12 8:27 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates