Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-530

[2304182] [RDR] [Hub recovery] [Co-situated] Unable to resolve DRPC State when the backed-up state differs from the VRG state

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • 4.17.1
    • ?
    • 4.17.0-118
    • ?
    • RamenDR sprint 2024 #18
    • Approved
    • None

      Description of problem (please be detailed as possible and provide log
      snippests): This BZ is an extension of BZ2302144 (meaning it is one of the issues observed while executing/filing BZ2302144) but will be tracked separately in this BZ.

      Version of all relevant components (if applicable):

      Platform- VMware

      OCP 4.16.0-0.nightly-2024-07-29-013917
      ACM 2.11.1 GA'ed
      MCE 2.6.1
      OADP 1.4.0
      ODF 4.16 GA'ed
      Gitops 1.13.1
      ceph version 18.2.1-194.el9cp (04a992766839cd3207877e518a1238cdbac3787e) reef (stable)
      Submariner 0.18.0
      VolSync 0.9.2

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. On a RDR setup with multiple workloads, rbd-appset(pull)/sub, cephfs-appset(pull)/sub, imperative app all in Deployed, FailedOver and Relocated state running on both the managed clusters, configure it for hub recovery but do not start taking new backups.
      2. Before backups are taken, ensure the above state is achieved.
      3. Now start taking backups and when we have 1 or 2 successful backups, either stop the backup or increase the backup time to allow certain action in between so that no new backup is taken.
      Collect must-gather and all other observations.
      4. Now move the workloads across managed clusters and achieve the same state as in Step 1.
      Meaning, move the workloads which are primary on C1 to C2, and vice-versa. Let the workloads in Deployed state remain as it is.
      5. Make sure this latest state of workloads and drpc is *NOT* backed up as mentioned in Step 3 above.

      Collect must-gather along with drpc state.

      Now perform site-failure (bring any of the managed cluster down along with the active hub cluster but ensure that there are multiple workloads on both the managed clusters and in the same state), then perform hub recovery.

      In my case, cluster C1 (amagrawa-12jul-c1) went down during site-failure.

      6. After moving to new hub, ensure drpolicy is validated and drpc is restored.
      7. Check the drpc status (it should match with the last backed up state of drpc as in Step 3 above).
      8. Check the deployment and pvc status of various workloads on the surviving managed cluster.
      9. After a few hours, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console of the new hub.
      10. Repeat step 7 and 8.

      Actual results:

      ================================================================================================================================================================
      DRPC state when backup was taken:

      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 6d1h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Relocated Completed 2024-07-29T09:41:28Z 32m29.415869422s True

      openshift-gitops cephfs-appset-busybox11-placement-drpc 6d1h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Relocated Completed 2024-07-29T09:41:11Z 5m44.774301626s True

      ================================================================================================================================================================
      DRPC state after backup was stopped:

      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 6d15h amagrawa-12jul-c2 amagrawa-12jul-c1 Failover FailedOver Completed 2024-07-31T08:01:33Z 4m19.917032579s True

      openshift-gitops cephfs-appset-busybox11-placement-drpc 6d15h amagrawa-12jul-c2 amagrawa-12jul-c1 Failover FailedOver Completed 2024-07-31T08:00:55Z 4m27.721755836s True

      ================================================================================================================================================================
      DRPC state after hub recovery:

      PEER READY became False and CURRENTSTATE reports Initiating but Relocate can not be performed as one of the managed cluster C1 is down

      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 9h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:25Z False

      openshift-gitops cephfs-appset-busybox11-placement-drpc 9h amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:22Z False

      ================================================================================================================================================================
      DRPC state after the down managed cluster is recovered and successfully imported on the RHACM console of the new hub:

      PEER READY is still False and CURRENTSTATE reports Initiating and remains stuck in the same state Forever

      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 11d amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:25Z False

      openshift-gitops cephfs-appset-busybox11-placement-drpc 11d amagrawa-12jul-c2 amagrawa-12jul-c1 Relocate Initiating 2024-07-31T10:33:22Z False

      ================================================================================================================================================================

      Logs collected before the backup was stopped (when all operations had successfully completed)-
      http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-backup-stopped/

      Logs collected before performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-before-hub-recovery/

      Logs collected after performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/31july24-after-hub-recovery/

      Expected results: After the down managed cluster is recovered and successfully imported on the RHACM console of the new hub, CURRENTSTATE for these workloads should report WaitForUser with PEER READY as True and admin should be able to relocate/failover them to the C2 managed cluster.

      Additional info:

              bmekhiss Benamar Mekhissi
              amagrawa@redhat.com Aman Agrawal
              Benamar Mekhissi
              Aman Agrawal Aman Agrawal
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: