-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.19.0, 4.20
-
Quality / Stability / Reliability
-
False
-
-
0
-
None
-
No
-
None
-
None
-
Rejected
-
OCPEDGE Sprint 278, OCPEDGE Sprint 279, OCPEDGE Sprint 280
-
3
-
In Progress
-
Bug Fix
-
-
None
-
None
-
None
-
None
Description of problem:
A rapid restart of podman-etcd fails, probably due to a misalignment of the clone notification environment variables[1] counting the number of active and inactive agents. This also has the effect of stalling cluster recovery.
Jul 11 09:26:59 master-0 pacemaker-controld[1885]: notice: Result of stop operation for etcd on master-0: ok Jul 11 09:26:59 master-0 pacemaker-controld[1885]: notice: Requesting local execution of start operation for etcd on master-0 Jul 11 09:27:00 master-0 podman-etcd(etcd)[9729]: NOTICE: podman-etcd start Jul 11 09:27:00 master-0 podman-etcd(etcd)[9762]: INFO: ensure etcd pod is not running (retries: 60, interval: 10) Jul 11 09:27:00 master-0 podman-etcd(etcd)[9896]: ERROR: Unexpected active resource count: 2 Jul 11 09:27:00 master-0 pacemaker-controld[1885]: notice: Result of start operation for etcd on master-0: error
Steps to reproduce
In a stable Two Nodes with Fencing cluster (any version, problem is in the underlyn RHCOS packages), which has an etcd cluster with 2 members, suddenly killing one of them will trigger this bug.
"sudo podman kill etcd" in one of the nodes is enough, but any way of stopping it ungracefully will suffice. The logs in the description above will appear in pacemaker logs (journalctl -u pacemaker).
On checking with "sudo pcs status", resource will look like this
* Clone Set: etcd-clone [etcd]:
* Started: [ javier-master-0-1 ]
* Stopped: [ javier-master-0-0 ]
Verification information
Bugfix is merged in https://github.com/ClusterLabs/resource-agents/pull/2082.
NOTE: Even with this bugfix, the etcd cluster might still fail to recover. There is a second, previously hidden bug fixed in https://github.com/ClusterLabs/resource-agents/pull/2089. If the system is verified with the first fix but not the second, the problem described here should not occur, but the cluster might still be unable to start properly.
With both fixes applied, the cluster should recover properly. Can be verified running "sudo pcs status" and checking that the etcd clone looks like this:
* Clone Set: etcd-clone [etcd]: * Started: [ javier-master-0-0 javier-master-0-1 ]