[DFBUGS-530] [2304182] [RDR] [Hub recovery] [Co-situated] Unable to resolve DRPC State when the backed-up state differs from the VRG state - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: odf-4.17.7
Affects Version/s: odf-4.16
Component/s: odf-dr/ramen
Labels:
- No-Doc-Update

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Internal Whiteboard:
4.17.1
Bugzilla Bug:
RHBZ: 2304182
Dev Approval:
Committed
Prod build version:
4.17.6-1
QE Approval:
Committed
Target Release:

odf-4.17.7
Intelligence Requested:
Market:

Sprint:
RamenDR sprint 2024 #18, RamenDR sprint 2024 #21

Release Blocker:
Approved
Target Version:

odf-4.17.7

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests): This BZ is an extension of BZ2302144 (meaning it is one of the issues observed while executing/filing BZ2302144) but will be tracked separately in this BZ.

Version of all relevant components (if applicable):

Platform- VMware

OCP 4.16.0-0.nightly-2024-07-29-013917
ACM 2.11.1 GA'ed
MCE 2.6.1
OADP 1.4.0
ODF 4.16 GA'ed
Gitops 1.13.1
ceph version 18.2.1-194.el9cp (04a992766839cd3207877e518a1238cdbac3787e) reef (stable)
Submariner 0.18.0
VolSync 0.9.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. On a RDR setup with multiple workloads, rbd-appset(pull)/sub, cephfs-appset(pull)/sub, imperative app all in Deployed, FailedOver and Relocated state running on both the managed clusters, configure it for hub recovery but do not start taking new backups.
2. Before backups are taken, ensure the above state is achieved.
3. Now start taking backups and when we have 1 or 2 successful backups, either stop the backup or increase the backup time to allow certain action in between so that no new backup is taken.
Collect must-gather and all other observations.
4. Now move the workloads across managed clusters and achieve the same state as in Step 1.
Meaning, move the workloads which are primary on C1 to C2, and vice-versa. Let the workloads in Deployed state remain as it is.
5. Make sure this latest state of workloads and drpc is *NOT* backed up as mentioned in Step 3 above.

Collect must-gather along with drpc state.

Now perform site-failure (bring any of the managed cluster down along with the active hub cluster but ensure that there are multiple workloads on both the managed clusters and in the same state), then perform hub recovery.

In my case, cluster C1 (amagrawa-12jul-c1) went down during site-failure.

6. After moving to new hub, ensure drpolicy is validated and drpc is restored.
7. Check the drpc status (it should match with the last backed up state of drpc as in Step 3 above).
8. Check the deployment and pvc status of various workloads on the surviving managed cluster.
9. After a few hours, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console of the new hub.
10. Repeat step 7 and 8.

Actual results:

================================================================================================================================================================
DRPC state when backup was taken:

busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 6d1h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Relocated Completed 2024-07-29T09:41:28Z 32m29.415869422s True

openshift-gitops cephfs-appset-busybox11-placement-drpc 6d1h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Relocated Completed 2024-07-29T09:41:11Z 5m44.774301626s True

================================================================================================================================================================
DRPC state after backup was stopped:

busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 6d15h amagrawa-12jul-c2 amagrawa-12jul-c1 Failover FailedOver Completed 2024-07-31T08:01:33Z 4m19.917032579s True

openshift-gitops cephfs-appset-busybox11-placement-drpc 6d15h amagrawa-12jul-c2 amagrawa-12jul-c1 Failover FailedOver Completed 2024-07-31T08:00:55Z 4m27.721755836s True

================================================================================================================================================================
DRPC state after hub recovery:

PEER READY became False and CURRENTSTATE reports Initiating but Relocate can not be performed as one of the managed cluster C1 is down

busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 9h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:25Z False

openshift-gitops cephfs-appset-busybox11-placement-drpc 9h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:22Z False

================================================================================================================================================================
DRPC state after the down managed cluster is recovered and successfully imported on the RHACM console of the new hub:

PEER READY is still False and CURRENTSTATE reports Initiating and remains stuck in the same state Forever

busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 11d amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:25Z False

openshift-gitops cephfs-appset-busybox11-placement-drpc 11d amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:22Z False

================================================================================================================================================================

Logs collected before the backup was stopped (when all operations had successfully completed)-
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-backup-stopped/

Logs collected before performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-hub-recovery/

Logs collected after performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-after-hub-recovery/

Expected results: After the down managed cluster is recovered and successfully imported on the RHACM console of the new hub, CURRENTSTATE for these workloads should report WaitForUser with PEER READY as True and admin should be able to relocate/failover them to the C2 managed cluster.

Additional info:

external trackers

Github RamenDR/ramen/pull/1584

Github red-hat-storage/ramen/pull/370

links to

RHBA-2025:148326 Red Hat OpenShift Data Foundation 4.17.7 Bug Fix Update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty

Hide