-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.17
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Version of all relevant components (if applicable):
OCP 4.17.0-0.nightly-2024-10-20-231827
ODF 4.17.0-126
ACM 2.12.0-DOWNSTREAM-2024-10-18-21-57-41
OpenShift Virtualization 4.17.1-19
Submariner 0.19 unreleased downstream image 846949
ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)
OADP 1.4.1
OpenShift GitOps 1.14.0
VolSync 0.10.1
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deploy a RBD CNV workload on a RDR setup using discovered apps. Create a clone of the PVC.
2. Delete the workload.
3. Now deploy a workload in such a way that it consumes the cloned PVC.
4. DR protect this workload with a drpolicy where flattening is not enabled.
5. The VR will go to primary and sync and backup would initially look fine for the workload but the RBD image will not undergo flattening.
6. After a while, sync wouldn't continue for this workload and it's hard to debug the root cause because proper error messages are missing in the VR/DRPC resource.
Actual results:
VR-
oc describe vr -n busybox-workloads-100
Name: root-disk
Namespace: busybox-workloads-100
Labels: ramendr.openshift.io/owner-name=busybox-100
ramendr.openshift.io/owner-namespace-name=openshift-dr-ops
Annotations: <none>
API Version: replication.storage.openshift.io/v1alpha1
Kind: VolumeReplication
Metadata:
Creation Timestamp: 2024-10-27T18:04:31Z
Finalizers:
replication.storage.openshift.io
Generation: 1
Resource Version: 9855180
UID: c4ae8511-9fa1-4a53-8374-8b87288255d1
Spec:
Auto Resync: false
Data Source:
API Group:
Kind: PersistentVolumeClaim
Name: root-disk
Replication Handle:
Replication State: primary
Volume Replication Class: rbd-volumereplicationclass-1625360775
Status:
Conditions:
Last Transition Time: 2024-10-27T18:04:35Z
Message:
Observed Generation: 1
Reason: Promoted
Status: True
Type: Completed
Last Transition Time: 2024-10-27T18:04:35Z
Message:
Observed Generation: 1
Reason: Healthy
Status: False
Type: Degraded
Last Transition Time: 2024-10-27T18:04:35Z
Message:
Observed Generation: 1
Reason: NotResyncing
Status: False
Type: Resyncing
Last Completion Time: 2024-10-27T18:47:06Z
Last Sync Duration: 0s
Last Sync Time: 2024-10-27T18:45:00Z
Message: volume is marked primary
Observed Generation: 1
State: Primary
Events: <none>
DRPC-
oc get drpc busybox-100 -oyaml -n openshift-dr-ops
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPlacementControl
metadata:
annotations:
drplacementcontrol.ramendr.openshift.io/app-namespace: openshift-dr-ops
drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-21o-1
creationTimestamp: "2024-10-27T18:04:31Z"
finalizers:
- drpc.ramendr.openshift.io/finalizer
generation: 2
labels:
cluster.open-cluster-management.io/backup: ramen
name: busybox-100
namespace: openshift-dr-ops
ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: Placement
name: busybox-100-placement-1
uid: e36cc23e-b6ad-4e24-ab76-0b8f2332aa9e
resourceVersion: "8573969"
uid: 552aaddd-3376-4550-ba3d-b7150e27ac91
spec:
drPolicyRef:
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPolicy
name: odr-policy-5m
kubeObjectProtection:
captureInterval: 5m0s
kubeObjectSelector:
matchExpressions: - key: appname
operator: In
values: - vm
placementRef:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
name: busybox-100-placement-1
namespace: openshift-dr-ops
preferredCluster: amagrawa-21o-1
protectedNamespaces: - busybox-workloads-100
pvcSelector:
matchExpressions: - key: appname
operator: In
values: - vm
status:
actionDuration: 15.045573062s
actionStartTime: "2024-10-27T18:04:46Z"
conditions: - lastTransitionTime: "2024-10-27T18:04:31Z"
message: Initial deployment completed
observedGeneration: 2
reason: Deployed
status: "True"
type: Available - lastTransitionTime: "2024-10-27T18:04:31Z"
message: Ready
observedGeneration: 2
reason: Success
status: "True"
type: PeerReady - lastTransitionTime: "2024-10-27T18:07:31Z"
message: VolumeReplicationGroup (openshift-dr-ops/busybox-100) on cluster amagrawa-21o-1
is protecting required resources and data
observedGeneration: 2
reason: Protected
status: "True"
type: Protected
lastGroupSyncDuration: 0s
lastGroupSyncTime: "2024-10-27T18:10:00Z"
lastKubeObjectProtectionTime: "2024-10-27T18:54:38Z"
lastUpdateTime: "2024-10-27T18:59:33Z"
observedGeneration: 2
phase: Deployed
preferredDecision:
clusterName: amagrawa-21o-1
clusterNamespace: amagrawa-21o-1
progression: Completed
resourceConditions:
conditions: - lastTransitionTime: "2024-10-27T18:04:35Z"
message: PVCs in the VolumeReplicationGroup are ready for use
observedGeneration: 1
reason: Ready
status: "True"
type: DataReady - lastTransitionTime: "2024-10-27T18:04:32Z"
message: VolumeReplicationGroup is replicating
observedGeneration: 1
reason: Replicating
status: "False"
type: DataProtected - lastTransitionTime: "2024-10-27T18:04:31Z"
message: Nothing to restore
observedGeneration: 1
reason: Restored
status: "True"
type: ClusterDataReady - lastTransitionTime: "2024-10-27T18:04:39Z"
message: Cluster data of all PVs are protected
observedGeneration: 1
reason: Uploaded
status: "True"
type: ClusterDataProtected
resourceMeta:
generation: 1
kind: VolumeReplicationGroup
name: busybox-100
namespace: openshift-dr-ops
protectedpvcs: - root-disk
resourceVersion: "9869528"
Although it fires the VolumeSyncronizationDelay alert on the hub cluster if cluster monitoring labelling is done, please note that it's optional and doesn't highlight where the root cause is. There could be n number of reasons why sync isn't progressing?
Also, to check if image under went flattening or not, one has to rsh into toolbox pod and run ceph progress command which isn't recommended for customers.
Expected results:[RDR] [Flatten] Proper error messages should be shown in VR and DRPC resource when a drpolicy without flattening is applied to cloned/snapshot restored PVC and sync doesn't resume/rbd-image doesn't undergo flattening.
Additional info: