-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.20
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During fencing validation, after a fencing event, one control-plane node (master-1) is reported as started + learner in etcdctl member list when checked from the peer (master-0). However, on master-1 itself there is no etcd container running. Pacemaker logs from both nodes reveal a fencing sequence where: master-0 is fenced and reboots. master-1 experiences a failure and restarts as well. Upon restart, both nodes compare etcd revisions. The node with a newer revision (force-new-cluster) wins; the other becomes a learner. In this instance, both agents restarted almost simultaneously after fencing, leading to a split-brain-like situation where master-1 ends up marked as learner with no etcd container running.
How reproducible: 70%
Steps to Reproduce:
Deploy a tnf setup. Trigger fencing of one node. (pcs fence) Observe Pacemaker and podman-etcd logs from both nodes. In some cases, both nodes restart nearly simultaneously, causing revision comparison to misbehave.
Actual results:
Fenced node joins as learner, etcd container not started.
Expected results:
The fenced node restarts after the leader or in a controlled sequence, ensuring etcd container starts and node joins as full member.
Additional info:
Suggested Fix: Per Carlo’s investigation, a quick mitigation could be: Have the node with the force-new-cluster attribute push a higher revision to the CIB upon restart to ensure it wins the comparison.