Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.20
Component/s: Two Node Fencing
Labels:
- qe-core

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
0
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
OCPEDGE Sprint 280, OCPEDGE Sprint 282, OCPEDGE Sprint 283, OCPEDGE Sprint 284
sprint_count:
4

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
*Cause*: Detection of static etcd pod being stopped is inconsistent
*Consequence*: Conflict on startup of podman etcd container, resulting on a failed installation
*Fix*: Improved detection of the not-fully-running etcd static pod in podman-etcd resource agent
*Result*: podman etcd only starts up when static etcd is really stopped

Show
*Cause*: Detection of static etcd pod being stopped is inconsistent *Consequence*: Conflict on startup of podman etcd container, resulting on a failed installation *Fix*: Improved detection of the not-fully-running etcd static pod in podman-etcd resource agent *Result*: podman etcd only starts up when static etcd is really stopped

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Version-Release number of selected component (if applicable):

    The installation of a TNF cluster using agent-based installation gets stuck due to failure in pacemaker to synchronize entities.
The journalctl shows failures related to missing revision.json file.

How reproducible:

Steps to Reproduce:

1.Install cluster acting as Hub cluster
2.Install MCE 
3.Provision the infrastruture for a new spoke cluster
4.Apply the manifests that deploy a TNF cluster javier-1_manifests.tgz 5.After the nodes are installed, the ACI status is "finalizing", but the status of the pacemaker running on the hosts is showing

Actual results:

    pcs status:

Full List of Resources:
  * Clone Set: kubelet-clone [kubelet]:
    * Started: [ javier-master-1-0 javier-master-1-1 ]
  * javier-master-1-0_redfish    (stonith:fence_redfish):     Started javier-master-1-0
  * javier-master-1-1_redfish    (stonith:fence_redfish):     Started javier-master-1-1
  * Clone Set: etcd-clone [etcd]:
    * Stopped: [ javier-master-1-0 javier-master-1-1 ]Failed Resource Actions:
  * etcd start on javier-master-1-1 returned 'error' (podman failed to launch container (error code: 1)) at Thu Nov  6 15:48:25 2025 after 2m6.080s
  * etcd start on javier-master-1-0 could not be executed (Timed Out: Resource agent did not complete within 10m) at Thu Nov  6 15:48:25 2025 after 10m2ms

Expected results:

    pcs status withour failed resources and TNF cluster deployed successfully

Additional info:

Manual workaround can be used "sudo pcs resource cleanup" to continue the installation.

is depended on by

OCPBUGS-68371 Race condition in CEO stops TNF assisted-service installation

Verified

Assignee:: Pablo Fontanilla

Reporter:: Francisco Javier Moreno

Need Info From:: None

Contributors:: None

QA Contact:: Gal Amado

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/11/06 9:37 PM

Updated:: 2026/02/18 8:43 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates