Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-632

[2283038] [RDR] [Hub recovery] [Co-situated] Failover never completes when peer ready is true but replication destination is missing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.16
    • odf-dr/ramen
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • Hide
      .Failover process fails when the `ReplicationDestination` resource has not been created yet

      If the user initiates a failover before the `LastGroupSyncTime` is updated, the failover process might fail. This failure is accompanied by an error message indicating that the `ReplicationDestination` does not exist.

      Workaround: To mitigate this issue, follow these steps:

      . Edit the `ManifestWork` for the VRG on the hub cluster.
      . Delete the following section from the manifest:

         ```
         /spec/workload/manifests/0/spec/volsync
         ```
      . Save the changes.

      Applying this workaround correctly ensures that the VRG skips attempting to restore the PVC using the `ReplicationDestination` resource. If the PVC already exists, the application uses it as is. If the PVC does not exist, a new PVC is created.
      Show
      .Failover process fails when the `ReplicationDestination` resource has not been created yet If the user initiates a failover before the `LastGroupSyncTime` is updated, the failover process might fail. This failure is accompanied by an error message indicating that the `ReplicationDestination` does not exist. Workaround: To mitigate this issue, follow these steps: . Edit the `ManifestWork` for the VRG on the hub cluster. . Delete the following section from the manifest:    ```    /spec/workload/manifests/0/spec/volsync    ``` . Save the changes. Applying this workaround correctly ensures that the VRG skips attempting to restore the PVC using the `ReplicationDestination` resource. If the PVC already exists, the application uses it as is. If the PVC does not exist, a new PVC is created.
    • Known Issue
    • RamenDR sprint 2024 #16
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
      OCP 4.16.0-0.nightly-2024-04-26-145258
      ODF 4.16.0-89.stable
      ACM 2.10.2
      MCE 2.5.2

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      ****Active hub co-situated with primary managed cluster****

      1. On a RDR setup with both RBD and CephFS workloads of subscription and appset (pull model) types in distinct states like Deployed, FailedOver and Relocated, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
      2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
      3. After successful failover, recover the down managed cluster.

      During cleanup, VRG both states would be marked as Secondary for cephfs workloads on the recovered managed cluster which would eventually mark peer ready as True in the drpc resource on hub but the replicationdestination would not be created on the recovered cluster until the eviction period timeout which is 24hrs as of now.

      4. Now failover cephfs workloads back to the surviving cluster where peer ready is marked as true but replication destination isn't created.

      Actual results: Since peer ready is marked as true for cephfs workloads in this case, UI will allow failover even if 1st sync has not completed due to missing replication destination.

      Marking peer ready is expected when VRG both states are marked as secondary on the recovered cluster (refer comment https://bugzilla.redhat.com/show_bug.cgi?id=2263488#c21), failover never completes when replication destination is missing.

      The idea is to allow the failover using the last restored PVC state back to the recovered cluster.

      New Hub-

      busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 7d9h amagrawa-c2-29a amagrawa-c1-29a Failover FailedOver WaitForReadiness 2024-05-17T07:41:59Z False

      oc get drpc -o yaml -n busybox-workloads-15
      apiVersion: v1
      items:

      • apiVersion: ramendr.openshift.io/v1alpha1
        kind: DRPlacementControl
        metadata:
        annotations:
        drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
        drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-c2-29a
        creationTimestamp: "2024-05-16T07:51:31Z"
        finalizers:
      • drpc.ramendr.openshift.io/finalizer
        generation: 3
        labels:
        cluster.open-cluster-management.io/backup: ramen
        velero.io/backup-name: acm-resources-generic-schedule-20240516070015
        velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20240516070015
        name: cephfs-sub-busybox15-placement-1-drpc
        namespace: busybox-workloads-15
        ownerReferences:
      • apiVersion: cluster.open-cluster-management.io/v1beta1
        blockOwnerDeletion: true
        controller: true
        kind: Placement
        name: cephfs-sub-busybox15-placement-1
        uid: 31b90e55-e8e3-42b4-8f0a-ca8a71daa7ab
        resourceVersion: "36276430"
        uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
        spec:
        action: Failover
        drPolicyRef:
        apiVersion: ramendr.openshift.io/v1alpha1
        kind: DRPolicy
        name: my-drpolicy-5
        failoverCluster: amagrawa-c1-29a
        placementRef:
        apiVersion: cluster.open-cluster-management.io/v1beta1
        kind: Placement
        name: cephfs-sub-busybox15-placement-1
        namespace: busybox-workloads-15
        preferredCluster: amagrawa-c2-29a
        pvcSelector:
        matchLabels:
        appname: busybox_app3_cephfs
        status:
        actionStartTime: "2024-05-17T07:41:59Z"
        conditions:
      • lastTransitionTime: "2024-05-17T07:42:28Z"
        message: Completed
        observedGeneration: 3
        reason: FailedOver
        status: "True"
        type: Available
      • lastTransitionTime: "2024-05-17T07:41:59Z"
        message: Started failover to cluster "amagrawa-c1-29a"
        observedGeneration: 3
        reason: NotStarted
        status: "False"
        type: PeerReady
        lastUpdateTime: "2024-05-23T16:40:50Z"
        phase: FailedOver
        preferredDecision:
        clusterName: amagrawa-c1-29a
        clusterNamespace: amagrawa-c1-29a
        progression: WaitForReadiness
        resourceConditions:
        conditions:
      • lastTransitionTime: "2024-05-16T08:00:28Z"
        message: All VolSync PVCs are ready
        observedGeneration: 6
        reason: Ready
        status: "True"
        type: DataReady
      • lastTransitionTime: "2024-05-16T08:00:28Z"
        message: Not all VolSync PVCs are protected
        observedGeneration: 6
        reason: DataProtected
        status: "False"
        type: DataProtected
      • lastTransitionTime: "2024-05-16T08:00:16Z"
        message: Nothing to restore
        observedGeneration: 6
        reason: Restored
        status: "True"
        type: ClusterDataReady
      • lastTransitionTime: "2024-05-16T08:00:28Z"
        message: Not all VolSync PVCs are protected
        observedGeneration: 6
        reason: DataProtected
        status: "False"
        type: ClusterDataProtected
        resourceMeta:
        generation: 6
        kind: VolumeReplicationGroup
        name: cephfs-sub-busybox15-placement-1-drpc
        namespace: busybox-workloads-15
        protectedpvcs:
      • busybox-pvc-1
        kind: List
        metadata:
        resourceVersion: ""

      Recovered cluster C1-

      oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide
      Now using project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443".
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
      persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem
      persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem

      NAME DESIREDSTATE CURRENTSTATE
      volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary

      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none>
      pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 26m 10.128.2.70 compute-2 <none> <none>
      pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 4m7s 10.128.2.71 compute-2 <none> <none>

      oc describe vrg
      Name: cephfs-sub-busybox15-placement-1-drpc
      Namespace: busybox-workloads-15
      Labels: <none>
      Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a
      drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc:
      drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
      API Version: ramendr.openshift.io/v1alpha1
      Kind: VolumeReplicationGroup
      Metadata:
      Creation Timestamp: 2024-04-30T13:36:03Z
      Finalizers:
      volumereplicationgroups.ramendr.openshift.io/vrg-protection
      Generation: 6
      Owner References:
      API Version: work.open-cluster-management.io/v1
      Kind: AppliedManifestWork
      Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw
      UID: 79905b6c-78f9-414c-abc9-a6506a5cf852
      Resource Version: 47644387
      UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af
      Spec:
      Action: Failover
      Async:
      Replication Class Selector:
      Scheduling Interval: 5m
      Volume Snapshot Class Selector:
      Pvc Selector:
      Match Labels:
      Appname: busybox_app3_cephfs
      Replication State: primary
      s3Profiles:
      s3profile-amagrawa-c1-29a-ocs-storagecluster
      s3profile-amagrawa-c2-29a-ocs-storagecluster
      Vol Sync:
      Status:
      Conditions:
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: All VolSync PVCs are ready
      Observed Generation: 6
      Reason: Ready
      Status: True
      Type: DataReady
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: Not all VolSync PVCs are protected
      Observed Generation: 6
      Reason: DataProtected
      Status: False
      Type: DataProtected
      Last Transition Time: 2024-05-16T08:00:16Z
      Message: Nothing to restore
      Observed Generation: 6
      Reason: Restored
      Status: True
      Type: ClusterDataReady
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: Not all VolSync PVCs are protected
      Observed Generation: 6
      Reason: DataProtected
      Status: False
      Type: ClusterDataProtected
      Kube Object Protection:
      Last Update Time: 2024-05-23T16:40:25Z
      Observed Generation: 6
      Protected PV Cs:
      Access Modes:
      ReadWriteMany
      Annotations:
      apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1
      apps.open-cluster-management.io/reconcile-option: merge
      Conditions:
      Last Transition Time: 2024-05-16T08:00:16Z
      Message: Ready
      Observed Generation: 6
      Reason: SourceInitialized
      Status: True
      Type: ReplicationSourceSetup
      Last Transition Time: 2024-05-16T07:59:24Z
      Message: PVC restored
      Observed Generation: 5
      Reason: Restored
      Status: True
      Type: PVsRestored
      Labels:
      App: cephfs-sub-busybox15
      app.kubernetes.io/part-of: cephfs-sub-busybox15
      Appname: busybox_app3_cephfs
      apps.open-cluster-management.io/reconcile-rate: medium
      velero.io/backup-name: acm-resources-schedule-20240516070016
      velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016
      Name: busybox-pvc-1
      Namespace: busybox-workloads-15
      Protected By Vol Sync: true
      Replication ID:
      Id:
      Resources:
      Requests:
      Storage: 94Gi
      Storage Class Name: ocs-storagecluster-cephfs
      Storage ID:
      Id:
      State: Primary
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal PrimaryVRGProcessSuccess 62m (x42 over 3h21m) controller_VolumeReplicationGroup Primary Success
      Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success

      C1 still has replication source for the failedover workload but not replication destination

      oc get replicationsources.volsync.backube -A
      NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC
      busybox-workloads-15 busybox-pvc-1 busybox-pvc-1

      Surviving cluster C2-

      oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide
      Already on project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443".
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
      persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem
      persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem

      NAME DESIREDSTATE CURRENTSTATE
      volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary

      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none>
      pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 28m 10.128.2.70 compute-2 <none> <none>
      pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 5m19s 10.128.2.71 compute-2 <none> <none>

      oc describe vrg
      Name: cephfs-sub-busybox15-placement-1-drpc
      Namespace: busybox-workloads-15
      Labels: <none>
      Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a
      drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc:
      drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
      API Version: ramendr.openshift.io/v1alpha1
      Kind: VolumeReplicationGroup
      Metadata:
      Creation Timestamp: 2024-04-30T13:36:03Z
      Finalizers:
      volumereplicationgroups.ramendr.openshift.io/vrg-protection
      Generation: 6
      Owner References:
      API Version: work.open-cluster-management.io/v1
      Kind: AppliedManifestWork
      Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw
      UID: 79905b6c-78f9-414c-abc9-a6506a5cf852
      Resource Version: 47644387
      UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af
      Spec:
      Action: Failover
      Async:
      Replication Class Selector:
      Scheduling Interval: 5m
      Volume Snapshot Class Selector:
      Pvc Selector:
      Match Labels:
      Appname: busybox_app3_cephfs
      Replication State: primary
      s3Profiles:
      s3profile-amagrawa-c1-29a-ocs-storagecluster
      s3profile-amagrawa-c2-29a-ocs-storagecluster
      Vol Sync:
      Status:
      Conditions:
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: All VolSync PVCs are ready
      Observed Generation: 6
      Reason: Ready
      Status: True
      Type: DataReady
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: Not all VolSync PVCs are protected
      Observed Generation: 6
      Reason: DataProtected
      Status: False
      Type: DataProtected
      Last Transition Time: 2024-05-16T08:00:16Z
      Message: Nothing to restore
      Observed Generation: 6
      Reason: Restored
      Status: True
      Type: ClusterDataReady
      Last Transition Time: 2024-05-16T08:00:28Z
      Message: Not all VolSync PVCs are protected
      Observed Generation: 6
      Reason: DataProtected
      Status: False
      Type: ClusterDataProtected
      Kube Object Protection:
      Last Update Time: 2024-05-23T16:40:25Z
      Observed Generation: 6
      Protected PV Cs:
      Access Modes:
      ReadWriteMany
      Annotations:
      apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1
      apps.open-cluster-management.io/reconcile-option: merge
      Conditions:
      Last Transition Time: 2024-05-16T08:00:16Z
      Message: Ready
      Observed Generation: 6
      Reason: SourceInitialized
      Status: True
      Type: ReplicationSourceSetup
      Last Transition Time: 2024-05-16T07:59:24Z
      Message: PVC restored
      Observed Generation: 5
      Reason: Restored
      Status: True
      Type: PVsRestored
      Labels:
      App: cephfs-sub-busybox15
      app.kubernetes.io/part-of: cephfs-sub-busybox15
      Appname: busybox_app3_cephfs
      apps.open-cluster-management.io/reconcile-rate: medium
      velero.io/backup-name: acm-resources-schedule-20240516070016
      velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016
      Name: busybox-pvc-1
      Namespace: busybox-workloads-15
      Protected By Vol Sync: true
      Replication ID:
      Id:
      Resources:
      Requests:
      Storage: 94Gi
      Storage Class Name: ocs-storagecluster-cephfs
      Storage ID:
      Id:
      State: Primary
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal PrimaryVRGProcessSuccess 63m (x42 over 3h22m) controller_VolumeReplicationGroup Primary Success
      Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success

      C2 has replication source too but that's expected (as failover is successful and workload is running on this cluster)

      oc get replicationsources.volsync.backube -A
      NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC
      busybox-workloads-15 busybox-pvc-1 busybox-pvc-1

      Expected results: Failover should complete with the last restored PVC state when replication is missing.

      Additional info:

              bmekhiss Benamar Mekhissi
              amagrawa@redhat.com Aman Agrawal
              Benamar Mekhissi, Erin Donnelly
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: