Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-612

[2252082] [RDR] [Hub recovery] drpc progression flickers b/w Cleaning Up and WaitForReadiness on failover operation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.14
    • odf-dr/ramen
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • RamenDR sprint 2024 #16, RamenDR sprint 2024 #20
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      OCP 4.14.0-0.nightly-2023-11-27-160916
      ACM v2.9.0-RC3
      ODF 4.14.1-13
      ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
      Submariner 0.16.2
      VolSync 0.8.0

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
      2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster).
      3. Ensure that we have the workloads in distict states like deployed, failedover, relocated etc.
      4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
      5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
      6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
      7. Let IOs continue for a few hours. Now bring the primary managed cluster down, wait for 10-15mins and then perform failover of the workloads to the secondary managed cluster and collect drpc -o wide output in a loop. (In this case cephfs workloads were failedover). Let primary managed cluster down and wait for drpc Progression to reach Cleaning up phase.

      Actual results: drpc progression flickers b/w Cleaning Up and WaitForReadiness on failover operation

      appset-cephfs-busybox7 was failedover in this case

      drpc -o wide ==>>

      openshift-gitops appset-cephfs-busybox7-placement-drpc 11h amagrawa-28n-c1 amagrawa-28n-c2 Failover FailingOver WaitingForResourceRestore 2023-11-29T08:20:34Z False

      Wednesday 29 November 2023 08:26:12 AM UTC
      ==========================================================================================================================

      openshift-gitops appset-cephfs-busybox7-placement-drpc 11h amagrawa-28n-c1 amagrawa-28n-c2 Failover FailedOver Cleaning Up 2023-11-29T08:20:34Z False

      Wednesday 29 November 2023 08:26:16 AM UTC
      ==========================================================================================================================

      openshift-gitops appset-cephfs-busybox7-placement-drpc 11h amagrawa-28n-c1 amagrawa-28n-c2 Failover FailedOver Cleaning Up 2023-11-29T08:20:34Z False

      Wednesday 29 November 2023 08:26:44 AM UTC
      ==========================================================================================================================

      then changes back to WaitForReadiness

      ==========================================================================================================================
      openshift-gitops appset-cephfs-busybox7-placement-drpc 11h amagrawa-28n-c1 amagrawa-28n-c2 Failover FailedOver WaitForReadiness 2023-11-29T08:20:34Z False

      Wednesday 29 November 2023 08:26:48 AM UTC
      ==========================================================================================================================

      and then changes back to Cleaning Up

      ==========================================================================================================================

      openshift-gitops appset-cephfs-busybox7-placement-drpc 11h amagrawa-28n-c1 amagrawa-28n-c2 Failover FailedOver Cleaning Up 2023-11-29T08:20:34Z False

      Wednesday 29 November 2023 08:32:18 AM UTC
      ==========================================================================================================================

      The test case was repeated on other cephfs workloads on the same setup where observation remains the same.

      Must gather logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29nov23-2/

      Expected results: Once Progression starts reporting Cleaning Up , it shouldn't go back to any other state.

      Additional info:

              bmekhiss Benamar Mekhissi
              amagrawa@redhat.com Aman Agrawal
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: