Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-134

[2312801] [RDR] CephFS Consistency Group: Sync stops after a few scheduling intervals

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.17
    • csi-addons
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      In an OCP,ODF 4.17 RDR setup with OCP feature gate enabled, after enabling DR on CephFS based workloads that use consistency groups,
      sync process stops unexpectedly after few scheduling intervals.

      Version of all relevant components (if applicable):
      OCP: 4.17.0-0.nightly-2024-09-09-120947
      ODF: 4.17.0-97
      ACM: 2.12.0-69 (2.12.0-DOWNSTREAM-2024-09-04-21-14-10)
      Submariner: 0.18.0 (Globalnet enabled)
      VolSync: 0.10.0

      Hub
      $ oc get pod -n openshift-operators ramen-hub-operator-79b6d95dbb-svfzx -o yaml | egrep "image:|imageID:"
      image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
      image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
      imageID: quay.io/bmekhiss/ramen-operator@sha256:d63eb9016368840112e08ddccebe130e554b3a801df6b5347a1ad5eb8fca90e1

      Managed cluster
      $ oc get pod -n openshift-dr-system -o yaml | egrep "image:|imageID:"
      image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
      image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
      image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
      imageID: quay.io/bmekhiss/ramen-operator@sha256:06cdf068c1e098f80fe6149e1d55b538b238c1facda8cda646a228cde48a98df

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      Yes, DR protection for workloads using consistency groups is not functioning.

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      3

      Can this issue reproducible?
      Yes

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Deploy RDR setup
      2. Enable required feature gate in OCP
      3. Deploy sample CephFS based workload that uses consistency group
      Workload consists of four PVCs, grouped into two Consistency Groups: cg1 and cg2.

      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
      busybox-cg1-pvc1 Bound pvc-d992671a-7178-4256-91c8-09accf4b76b3 1Gi RWX rook-cephfs1 <unset> 7s
      busybox-cg1-pvc2 Bound pvc-23f2f074-5627-4923-bfa6-05faea435d38 2Gi RWX rook-cephfs1 <unset> 7s
      busybox-cg2-pvc1 Bound pvc-9faea735-6516-41fe-b672-ca6410ff0e67 3Gi RWX rook-cephfs2 <unset> 7s
      busybox-cg2-pvc2 Bound pvc-31e69161-9858-4078-bcda-9bc2a3445b4d 4Gi RWX rook-cephfs2 <unset> 7s

      4. Enable DR protection via UI using current documented steps
      5. Scale down Ramen hub operator
      6. Delete the VRG ManifestWorks on hub cluster
      7. Delete the workload deployments on primary cluster
      8. Edit the DRPC resource and add the annotation drplacementcontrol.ramendr.openshift.io/is-cg-enabled: "true"
      9. Wait for the workload to be redeployed on the primary cluster
      10.Scale up the Ramen hub operator
      11. Wait for few scheduling intervals and observe sync starts lagging.
      12. DRPC lastGroupSyncTime doesn't get updated. VGS resource is not deleted and appears to be stuck on primary cluster.

      Actual results:
      Sync for workload that uses consistency group stops after sometime.

      Expected results:
      Sync for workload that uses consistency group shouldn't stop.

      Additional info:

      Error message observed in pod logs:
      ERROR ReplicationGroupSourceMachine cephfscg/replicationgroupsource.go:108 Failed to create volume group snapshot {"controller": "replicationgroupsource", "controllerGroup":
      "ramendr.openshift.io", "controllerKind": "ReplicationGroupSource", "ReplicationGroupSource":

      {"name":"busybox-cg1-placement-1-drpccg2","namespace":"busybox-cg1"}

      , "namespace": "busybox-cg1", "name": "busybox-cg1
      -placement-1-drpccg2", "reconcileID": "fd6d1671-4780-4c93-833e-1785d7fa29b7", "error": "the volume group snapshot is being deleted, need to wait"}
      github.com/ramendr/ramen/internal/controller/cephfscg.(*replicationGroupSourceMachine).Synchronize
      /workspace/internal/controller/cephfscg/replicationgroupsource.go:108
      github.com/backube/volsync/controllers/statemachine.doSynchronizingState
      /go/pkg/mod/github.com/backube/volsync@v0.7.1/controllers/statemachine/machine.go:102
      github.com/backube/volsync/controllers/statemachine.Run
      /go/pkg/mod/github.com/backube/volsync@v0.7.1/controllers/statemachine/machine.go:70
      github.com/ramendr/ramen/internal/controller.(*ReplicationGroupSourceReconciler).Reconcile
      /workspace/internal/controller/replicationgroupsource_controller.go:116
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222

              ypadia@redhat.com Yati Padia
              sagrawal@redhat.com Sidhant Agrawal
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: