-
Bug
-
Resolution: Done
-
Critical
-
odf-4.18.2
-
True
-
-
False
-
Committed
-
?
-
?
-
4.19.1-1.konflux
-
Committed
-
Release Note Not Required
-
Critical
-
Approved
-
None
Description of problem - RDR: After cephfs app enrolment for RDR, all the 3 compute nodes one by one on secondary site ag5 goes into NotReady state. Before app enrollment all nodes were in Ready state
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
OCP HCI 6 node 4.18 cluster
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
Provider RDR
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
OCP 4.18
ODF 4.18.2
ACM 2.13
Seen on racks: rackm03, rackm14, ag5
Does this issue impact your ability to continue to work with the product?
Yes
Is there any workaround available to the best of your knowledge?
No
Can this issue be reproduced? If so, please provide the hit rate
Yes
Can this issue be reproduced from the UI?
Yes
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deployed apps on ag4 cluster cephfs-tarun-1 using gitops using cephfs storage class
2. Enrolled above app for DR
3. After cephfs app enrolment for RDR, node on secondary site ag5 goes into NotReady state. Before app enrolment all nodes were in Ready state
4.initially only 1 compute node was in NotReady state later another compute node also gone into NotReady
For RBD also observed today.
1. Enroll discovered app for ag4.
2. Failover to site2 ag5.
3.Observed on primary site ag4, one node went into Notready state
NAME STATUS ROLES AGE VERSION
compute-1-ru5.rackag4.mydomain.com Ready worker 40d v1.31.5
compute-1-ru6.rackag4.mydomain.com Ready worker 40d v1.31.5
compute-1-ru7.rackag4.mydomain.com NotReady worker 40d v1.31.5
control-1-ru2.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5
control-1-ru3.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5
control-1-ru4.rackag4.mydomain.com Ready control-plane,master 47d v1.31.5
----- Sat Apr 26 03:03:55 EDT 2025 -----
NAME STATUS ROLES AGE VERSION
compute-1-ru5.rackag5.mydomain.com Ready worker 17d v1.31.6
compute-1-ru6.rackag5.mydomain.com NotReady worker 16d v1.31.6
compute-1-ru7.rackag5.mydomain.com Ready worker 17d v1.31.6
control-1-ru2.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
control-1-ru3.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
control-1-ru4.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
----- Sat Apr 26 03:04:00 EDT 2025 -----
NAME STATUS ROLES AGE VERSION
compute-1-ru5.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru6.rackag5.mydomain.com NotReady worker 16d v1.31.6
compute-1-ru7.rackag5.mydomain.com Ready worker 17d v1.31.6
control-1-ru2.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
control-1-ru3.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
control-1-ru4.rackag5.mydomain.com Ready control-plane,master 17d v1.31.6
Later observed node compute-1-ru7.rackag5.mydomain.com also went into NotReady state.
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.
The exact date and time when the issue was observed, including timezone details:
Actual results:
Nodes are getting rebooted simultaneously when dr protected cephfs workload is running from sometime.
Expected results:
Nodes should not get rebooted.
Logs collected and log location:
rackm03- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm03/30-04-2025_15-07-58/30-04-2025_15-07-58-cl/
rackm14- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm14/30-04-2025_15-08-19/30-04-2025_15-08-19-api-rackm14-mydomain-com/
Additional info:
Slack discussion
- links to