Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.20.z, 4.21.z, 4.22
Component/s: Two Node Fencing
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Proposed
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

It is possible for podman-etcd to try to recover in a state where the etcd revision on a learner is higher than the revision on a voter. This seems to be expecially prevalent in environments with slow disks where the voter gets starved for bandwidth, times out, and then is fenced.

Version-Release number of selected component (if applicable):

Current podman-etcd in 9.6 + 9.8

How reproducible:

Easiest way to reproduce is to spoof the revision numbers and then fencing the voter.

Steps to Reproduce:

    1. Start one of your nodes as a learner
    2. Spoof the revision so it's higher than the voting member node
    3. Fence the voting member node

Actual results:

Recovery deadlock - no node can recover since the voter waits for the higher revision node to start, and the learner crashes because it has no voting members.

Expected results:

Learners are not real members. Their revision numbers don't matter. We need to check the revision numbers on only voting members when we check which node to start from.

Additional info:

We observed a similar case to this where the revision numbers were equal, so we were able to start the cluster, but one of the nodes (the learner) crashed immediately and pacemaker was left in a state where it thought that node was healthy despite etcd not running there. I would have expected this to be caught by the monitor code, but I didn't watch it long enough to observe if this happened.

Assignee:: Carlo Lobrano

Reporter:: Jeremy Poulin

QA Contact:: Douglas Hensel

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2026/03/06 8:05 PM

Updated:: 2026/03/10 2:23 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates