-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.17
-
None
Description of problem (please be detailed as possible and provide log
snippests):
In an OCP,ODF 4.17 RDR setup with OCP feature gate enabled, after enabling DR on CephFS based workloads that use consistency groups,
sync process stops unexpectedly after few scheduling intervals.
Version of all relevant components (if applicable):
OCP: 4.17.0-0.nightly-2024-09-09-120947
ODF: 4.17.0-97
ACM: 2.12.0-69 (2.12.0-DOWNSTREAM-2024-09-04-21-14-10)
Submariner: 0.18.0 (Globalnet enabled)
VolSync: 0.10.0
Hub
$ oc get pod -n openshift-operators ramen-hub-operator-79b6d95dbb-svfzx -o yaml | egrep "image:|imageID:"
image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
imageID: quay.io/bmekhiss/ramen-operator@sha256:d63eb9016368840112e08ddccebe130e554b3a801df6b5347a1ad5eb8fca90e1
Managed cluster
$ oc get pod -n openshift-dr-system -o yaml | egrep "image:|imageID:"
image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
image: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9@sha256:334c6f7811e882797be1a1ed74cbb878c65d6bb6621be1544cf12f693091f9fb
image: quay.io/bmekhiss/ramen-operator:upgrade.to.latest.crd
imageID: quay.io/bmekhiss/ramen-operator@sha256:06cdf068c1e098f80fe6149e1d55b538b238c1facda8cda646a228cde48a98df
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, DR protection for workloads using consistency groups is not functioning.
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
Yes
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deploy RDR setup
2. Enable required feature gate in OCP
3. Deploy sample CephFS based workload that uses consistency group
Workload consists of four PVCs, grouped into two Consistency Groups: cg1 and cg2.
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
busybox-cg1-pvc1 Bound pvc-d992671a-7178-4256-91c8-09accf4b76b3 1Gi RWX rook-cephfs1 <unset> 7s
busybox-cg1-pvc2 Bound pvc-23f2f074-5627-4923-bfa6-05faea435d38 2Gi RWX rook-cephfs1 <unset> 7s
busybox-cg2-pvc1 Bound pvc-9faea735-6516-41fe-b672-ca6410ff0e67 3Gi RWX rook-cephfs2 <unset> 7s
busybox-cg2-pvc2 Bound pvc-31e69161-9858-4078-bcda-9bc2a3445b4d 4Gi RWX rook-cephfs2 <unset> 7s
4. Enable DR protection via UI using current documented steps
5. Scale down Ramen hub operator
6. Delete the VRG ManifestWorks on hub cluster
7. Delete the workload deployments on primary cluster
8. Edit the DRPC resource and add the annotation drplacementcontrol.ramendr.openshift.io/is-cg-enabled: "true"
9. Wait for the workload to be redeployed on the primary cluster
10.Scale up the Ramen hub operator
11. Wait for few scheduling intervals and observe sync starts lagging.
12. DRPC lastGroupSyncTime doesn't get updated. VGS resource is not deleted and appears to be stuck on primary cluster.
Actual results:
Sync for workload that uses consistency group stops after sometime.
Expected results:
Sync for workload that uses consistency group shouldn't stop.
Additional info:
Error message observed in pod logs:
ERROR ReplicationGroupSourceMachine cephfscg/replicationgroupsource.go:108 Failed to create volume group snapshot {"controller": "replicationgroupsource", "controllerGroup":
"ramendr.openshift.io", "controllerKind": "ReplicationGroupSource", "ReplicationGroupSource":
, "namespace": "busybox-cg1", "name": "busybox-cg1
-placement-1-drpccg2", "reconcileID": "fd6d1671-4780-4c93-833e-1785d7fa29b7", "error": "the volume group snapshot is being deleted, need to wait"}
github.com/ramendr/ramen/internal/controller/cephfscg.(*replicationGroupSourceMachine).Synchronize
/workspace/internal/controller/cephfscg/replicationgroupsource.go:108
github.com/backube/volsync/controllers/statemachine.doSynchronizingState
/go/pkg/mod/github.com/backube/volsync@v0.7.1/controllers/statemachine/machine.go:102
github.com/backube/volsync/controllers/statemachine.Run
/go/pkg/mod/github.com/backube/volsync@v0.7.1/controllers/statemachine/machine.go:70
github.com/ramendr/ramen/internal/controller.(*ReplicationGroupSourceReconciler).Reconcile
/workspace/internal/controller/replicationgroupsource_controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222