-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20, 4.21, 4.22
-
None
-
False
-
-
0
-
Moderate
-
None
-
None
-
None
-
OCPEDGE Sprint 281, OCPEDGE Sprint 282
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During TNF (Two Nodes with Fencing) installation via MCE Assisted Installer, the Pacemaker etcd resource fails to start on one node because the /var/lib/etcd directory does not exist yet. This is caused by a race condition where the TNF setup job creates the Pacemaker etcd resource before the CEO installer pod has created the etcd data directory on the other node (which was the bootstrap node before).
A more detailed analysis can be found here:
Manifests here: attached manifests
Version-Release number of selected component (if applicable):
4.20+
How reproducible:
95% of installation attempts
Steps to Reproduce:
1.Run MCE Assisted Installer installation with attached manifests(includes fix for ClusterLabs/resource-agents)
2.Wait for bootstrap node to restart
3.Pacemaker will take over etcd cluster with only one node
4.Installation will never finish
Actual results:
Installation gets stuck because CEO never sees ETCD cluster fully stable with 2 nodes
Expected results:
Installation finishes successfully
Additional info:
This bug can only be reproduced when OCPBUGS-64765 fix is applied. That fix is included in the attached manifests
- depends on
-
OCPBUGS-64765 TNF assisted-service installation stuck due to address/port conflict
-
- ON_QA
-
- links to