XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19.0, 4.20
Component/s: Two Node Fencing
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
0
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
OCPEDGE Sprint 278, OCPEDGE Sprint 279, OCPEDGE Sprint 280, OCPEDGE Sprint 281, OCPEDGE Sprint 282
sprint_count:
5

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
*Cause*: Rapid restart of a podman etcd container in a two member etcd cluster
*Consequence*: etcd cluster doesn't recover from the node loss and ends up with only one member
*Fix*: Fixed the resource count according to Pacemaker guidelines
*Result*: Pacemaker can progress with handling the loss of a member

Show
*Cause*: Rapid restart of a podman etcd container in a two member etcd cluster *Consequence*: etcd cluster doesn't recover from the node loss and ends up with only one member *Fix*: Fixed the resource count according to Pacemaker guidelines *Result*: Pacemaker can progress with handling the loss of a member

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

A rapid restart of podman-etcd fails, probably due to a misalignment of the clone notification environment variables[1] counting the number of active and inactive agents. This also has the effect of stalling cluster recovery.

Jul 11 09:26:59 master-0 pacemaker-controld[1885]:  notice: Result of stop operation for etcd on master-0: ok
Jul 11 09:26:59 master-0 pacemaker-controld[1885]:  notice: Requesting local execution of start operation for etcd on master-0
Jul 11 09:27:00 master-0 podman-etcd(etcd)[9729]: NOTICE: podman-etcd start
Jul 11 09:27:00 master-0 podman-etcd(etcd)[9762]: INFO: ensure etcd pod is not running (retries: 60, interval: 10)
Jul 11 09:27:00 master-0 podman-etcd(etcd)[9896]: ERROR: Unexpected active resource count: 2
Jul 11 09:27:00 master-0 pacemaker-controld[1885]:  notice: Result of start operation for etcd on master-0: error

[1]: https://clusterlabs.org/projects/pacemaker/doc/2.1/Pacemaker_Administration/html/agents.html#clone-notifications

Steps to reproduce

In a stable Two Nodes with Fencing cluster (any version, problem is in the underlyn RHCOS packages), which has an etcd cluster with 2 members, suddenly killing one of them will trigger this bug.

"sudo podman kill etcd" in one of the nodes is enough, but any way of stopping it ungracefully will suffice. The logs in the description above will appear in pacemaker logs (journalctl -u pacemaker).

On checking with "sudo pcs status", resource will look like this

  * Clone Set: etcd-clone [etcd]:
    * Started: [ javier-master-0-1 ]
    * Stopped: [ javier-master-0-0 ]

Verification information

Bugfix is merged in https://github.com/ClusterLabs/resource-agents/pull/2082.
NOTE: Even with this bugfix, the etcd cluster might still fail to recover. There is a second, previously hidden bug fixed in https://github.com/ClusterLabs/resource-agents/pull/2089. If the system is verified with the first fix but not the second, the problem described here should not occur, but the cluster might still be unable to start properly.
With both fixes applied, the cluster should recover properly. Can be verified running "sudo pcs status" and checking that the etcd clone looks like this:

* Clone Set: etcd-clone [etcd]:
 * Started: [ javier-master-0-0 javier-master-0-1 ]

Assignee:: Pablo Fontanilla

Reporter:: Carlo Lobrano

QA Contact:: Douglas Hensel

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/07/11 10:07 AM

Updated:: 2026/01/08 6:33 PM

Details

Description

Description of problem:

Steps to reproduce

Verification information

Attachments

Easy Agile Planning Poker

Activity

People

Dates