Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: odf-4.19.1
Affects Version/s: odf-4.18.2
Component/s: odf-dr/ramen
Labels:

Blocked:
True
Blocked Reason:

Hide

https://issues.redhat.com/browse/RHEL-92887

Show
https://issues.redhat.com/browse/RHEL-92887
Ready:
False
Dev Approval:
Committed
Docs Approval:
?
PM Approval:
?
Prod build version:
4.19.1-1.konflux
QE Approval:
Committed
Release Note Type:
Release Note Not Required
Target Release:

odf-4.19.1

Severity:
Critical

Release Blocker:
Approved
Target Version:

odf-4.19.1

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - RDR: After cephfs app enrolment for RDR, all the 3 compute nodes one by one on secondary site ag5 goes into NotReady state. Before app enrollment all nodes were in Ready state

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

OCP HCI 6 node 4.18 cluster

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Provider RDR

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

OCP 4.18
ODF 4.18.2
ACM 2.13
Seen on racks: rackm03, rackm14, ag5

Does this issue impact your ability to continue to work with the product?

Yes

Is there any workaround available to the best of your knowledge?

Can this issue be reproduced? If so, please provide the hit rate

Yes

Can this issue be reproduced from the UI?

Yes

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

1. Deployed apps on ag4 cluster cephfs-tarun-1 using gitops using cephfs storage class

2. Enrolled above app for DR

3. After cephfs app enrolment for RDR, node on secondary site ag5 goes into NotReady state. Before app enrolment all nodes were in Ready state

4.initially only 1 compute node was in NotReady state later another compute node also gone into NotReady

For RBD also observed today.
1. Enroll discovered app for ag4.
2. Failover to site2 ag5.
3.Observed on primary site ag4, one node went into Notready state

NAME STATUS ROLES AGE VERSION

compute-1-ru5.rackag4.mydomain.com Ready worker 40d v1.31.5

compute-1-ru6.rackag4.mydomain.com Ready worker 40d v1.31.5

compute-1-ru7.rackag4.mydomain.com NotReady worker 40d v1.31.5

control-1-ru2.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5

control-1-ru3.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5

control-1-ru4.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5

----- Sat Apr 26 03:03:55 EDT 2025 -----

NAME STATUS ROLES AGE VERSION

compute-1-ru5.rackag5.mydomain.com Ready worker 17d v1.31.6

compute-1-ru6.rackag5.mydomain.com NotReady worker 16d v1.31.6

compute-1-ru7.rackag5.mydomain.com Ready worker 17d v1.31.6

control-1-ru2.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

control-1-ru3.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

control-1-ru4.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

----- Sat Apr 26 03:04:00 EDT 2025 -----

NAME STATUS ROLES AGE VERSION

compute-1-ru5.rackag5.mydomain.com NotReady worker 17d v1.31.6

compute-1-ru6.rackag5.mydomain.com NotReady worker 16d v1.31.6

compute-1-ru7.rackag5.mydomain.com Ready worker 17d v1.31.6

control-1-ru2.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

control-1-ru3.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

control-1-ru4.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6

Later observed node compute-1-ru7.rackag5.mydomain.com also went into NotReady state.

compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.

The exact date and time when the issue was observed, including timezone details:

Actual results:
Nodes are getting rebooted simultaneously when dr protected cephfs workload is running from sometime.

Expected results:
Nodes should not get rebooted.

Logs collected and log location:
rackm03- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm03/30-04-2025_15-07-58/30-04-2025_15-07-58-cl/
rackm14- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm14/30-04-2025_15-08-19/30-04-2025_15-08-19-api-rackm14-mydomain-com/

Additional info:
Slack discussion

links to

RH Customer Case

Assignee:: Karolin Seeger

Reporter:: Tarun Kumar (Inactive)

Need Info From:: Karolin Seeger

QA Contact:: Pratik Surve

Votes:: 0 Vote for this issue

Watchers:: 23 Start watching this issue

Created:: 2025/04/28 2:05 PM

Updated:: 2025/08/12 1:15 PM

Resolved:: 2025/08/12 1:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty

Hide