-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.16
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Version of all relevant components (if applicable):
ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
OCP 4.16.0-0.nightly-2024-04-26-145258
ODF 4.16.0-89.stable
ACM 2.10.2
MCE 2.5.2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
****Active hub co-situated with primary managed cluster****
1. On a RDR setup with both RBD and CephFS workloads of subscription and appset (pull model) types in distinct states like Deployed, FailedOver and Relocated, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
3. After successful failover, recover the down managed cluster.
During cleanup, VRG both states would be marked as Secondary for cephfs workloads on the recovered managed cluster which would eventually mark peer ready as True in the drpc resource on hub but the replicationdestination would not be created on the recovered cluster until the eviction period timeout which is 24hrs as of now.
4. Now failover cephfs workloads back to the surviving cluster where peer ready is marked as true but replication destination isn't created.
Actual results: Since peer ready is marked as true for cephfs workloads in this case, UI will allow failover even if 1st sync has not completed due to missing replication destination.
Marking peer ready is expected when VRG both states are marked as secondary on the recovered cluster (refer comment https://bugzilla.redhat.com/show_bug.cgi?id=2263488#c21), failover never completes when replication destination is missing.
The idea is to allow the failover using the last restored PVC state back to the recovered cluster.
New Hub-
busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 7d9h amagrawa-c2-29a amagrawa-c1-29a Failover FailedOver WaitForReadiness 2024-05-17T07:41:59Z False
oc get drpc -o yaml -n busybox-workloads-15
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPlacementControl
metadata:
annotations:
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-c2-29a
creationTimestamp: "2024-05-16T07:51:31Z"
finalizers: - drpc.ramendr.openshift.io/finalizer
generation: 3
labels:
cluster.open-cluster-management.io/backup: ramen
velero.io/backup-name: acm-resources-generic-schedule-20240516070015
velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20240516070015
name: cephfs-sub-busybox15-placement-1-drpc
namespace: busybox-workloads-15
ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: Placement
name: cephfs-sub-busybox15-placement-1
uid: 31b90e55-e8e3-42b4-8f0a-ca8a71daa7ab
resourceVersion: "36276430"
uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
spec:
action: Failover
drPolicyRef:
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPolicy
name: my-drpolicy-5
failoverCluster: amagrawa-c1-29a
placementRef:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
name: cephfs-sub-busybox15-placement-1
namespace: busybox-workloads-15
preferredCluster: amagrawa-c2-29a
pvcSelector:
matchLabels:
appname: busybox_app3_cephfs
status:
actionStartTime: "2024-05-17T07:41:59Z"
conditions: - lastTransitionTime: "2024-05-17T07:42:28Z"
message: Completed
observedGeneration: 3
reason: FailedOver
status: "True"
type: Available - lastTransitionTime: "2024-05-17T07:41:59Z"
message: Started failover to cluster "amagrawa-c1-29a"
observedGeneration: 3
reason: NotStarted
status: "False"
type: PeerReady
lastUpdateTime: "2024-05-23T16:40:50Z"
phase: FailedOver
preferredDecision:
clusterName: amagrawa-c1-29a
clusterNamespace: amagrawa-c1-29a
progression: WaitForReadiness
resourceConditions:
conditions: - lastTransitionTime: "2024-05-16T08:00:28Z"
message: All VolSync PVCs are ready
observedGeneration: 6
reason: Ready
status: "True"
type: DataReady - lastTransitionTime: "2024-05-16T08:00:28Z"
message: Not all VolSync PVCs are protected
observedGeneration: 6
reason: DataProtected
status: "False"
type: DataProtected - lastTransitionTime: "2024-05-16T08:00:16Z"
message: Nothing to restore
observedGeneration: 6
reason: Restored
status: "True"
type: ClusterDataReady - lastTransitionTime: "2024-05-16T08:00:28Z"
message: Not all VolSync PVCs are protected
observedGeneration: 6
reason: DataProtected
status: "False"
type: ClusterDataProtected
resourceMeta:
generation: 6
kind: VolumeReplicationGroup
name: cephfs-sub-busybox15-placement-1-drpc
namespace: busybox-workloads-15
protectedpvcs: - busybox-pvc-1
kind: List
metadata:
resourceVersion: ""
Recovered cluster C1-
oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide
Now using project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443".
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem
persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none>
pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 26m 10.128.2.70 compute-2 <none> <none>
pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 4m7s 10.128.2.71 compute-2 <none> <none>
oc describe vrg
Name: cephfs-sub-busybox15-placement-1-drpc
Namespace: busybox-workloads-15
Labels: <none>
Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a
drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc:
drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
API Version: ramendr.openshift.io/v1alpha1
Kind: VolumeReplicationGroup
Metadata:
Creation Timestamp: 2024-04-30T13:36:03Z
Finalizers:
volumereplicationgroups.ramendr.openshift.io/vrg-protection
Generation: 6
Owner References:
API Version: work.open-cluster-management.io/v1
Kind: AppliedManifestWork
Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw
UID: 79905b6c-78f9-414c-abc9-a6506a5cf852
Resource Version: 47644387
UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af
Spec:
Action: Failover
Async:
Replication Class Selector:
Scheduling Interval: 5m
Volume Snapshot Class Selector:
Pvc Selector:
Match Labels:
Appname: busybox_app3_cephfs
Replication State: primary
s3Profiles:
s3profile-amagrawa-c1-29a-ocs-storagecluster
s3profile-amagrawa-c2-29a-ocs-storagecluster
Vol Sync:
Status:
Conditions:
Last Transition Time: 2024-05-16T08:00:28Z
Message: All VolSync PVCs are ready
Observed Generation: 6
Reason: Ready
Status: True
Type: DataReady
Last Transition Time: 2024-05-16T08:00:28Z
Message: Not all VolSync PVCs are protected
Observed Generation: 6
Reason: DataProtected
Status: False
Type: DataProtected
Last Transition Time: 2024-05-16T08:00:16Z
Message: Nothing to restore
Observed Generation: 6
Reason: Restored
Status: True
Type: ClusterDataReady
Last Transition Time: 2024-05-16T08:00:28Z
Message: Not all VolSync PVCs are protected
Observed Generation: 6
Reason: DataProtected
Status: False
Type: ClusterDataProtected
Kube Object Protection:
Last Update Time: 2024-05-23T16:40:25Z
Observed Generation: 6
Protected PV Cs:
Access Modes:
ReadWriteMany
Annotations:
apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1
apps.open-cluster-management.io/reconcile-option: merge
Conditions:
Last Transition Time: 2024-05-16T08:00:16Z
Message: Ready
Observed Generation: 6
Reason: SourceInitialized
Status: True
Type: ReplicationSourceSetup
Last Transition Time: 2024-05-16T07:59:24Z
Message: PVC restored
Observed Generation: 5
Reason: Restored
Status: True
Type: PVsRestored
Labels:
App: cephfs-sub-busybox15
app.kubernetes.io/part-of: cephfs-sub-busybox15
Appname: busybox_app3_cephfs
apps.open-cluster-management.io/reconcile-rate: medium
velero.io/backup-name: acm-resources-schedule-20240516070016
velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016
Name: busybox-pvc-1
Namespace: busybox-workloads-15
Protected By Vol Sync: true
Replication ID:
Id:
Resources:
Requests:
Storage: 94Gi
Storage Class Name: ocs-storagecluster-cephfs
Storage ID:
Id:
State: Primary
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal PrimaryVRGProcessSuccess 62m (x42 over 3h21m) controller_VolumeReplicationGroup Primary Success
Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success
C1 still has replication source for the failedover workload but not replication destination
oc get replicationsources.volsync.backube -A
NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC
busybox-workloads-15 busybox-pvc-1 busybox-pvc-1
Surviving cluster C2-
oc project busybox-workloads-15; oc get pvc,vr,vrg,pods -o wide
Already on project "busybox-workloads-15" on server "https://api.amagrawa-c2-29a.qe.rh-ocs.com:6443".
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-a333d21e-3ab7-425d-8254-8fa62522dc3f 94Gi RWX ocs-storagecluster-cephfs <unset> 23d Filesystem
persistentvolumeclaim/volsync-busybox-pvc-1-src Bound pvc-06823313-ed2d-49df-9773-55ef9a56f114 94Gi ROX ocs-storagecluster-cephfs-vrg <unset> 7d9h Filesystem
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox15-placement-1-drpc primary Primary
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-1-7f9b67dc95-6hjn2 1/1 Running 0 7d9h 10.128.3.234 compute-2 <none> <none>
pod/volsync-rsync-tls-src-busybox-pvc-1-676pm 0/1 Error 0 28m 10.128.2.70 compute-2 <none> <none>
pod/volsync-rsync-tls-src-busybox-pvc-1-zxzl5 1/1 Running 0 5m19s 10.128.2.71 compute-2 <none> <none>
oc describe vrg
Name: cephfs-sub-busybox15-placement-1-drpc
Namespace: busybox-workloads-15
Labels: <none>
Annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: amagrawa-c2-29a
drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc:
drplacementcontrol.ramendr.openshift.io/drpc-uid: e0bb1638-5fa6-4b45-8a5e-2bc688c38101
API Version: ramendr.openshift.io/v1alpha1
Kind: VolumeReplicationGroup
Metadata:
Creation Timestamp: 2024-04-30T13:36:03Z
Finalizers:
volumereplicationgroups.ramendr.openshift.io/vrg-protection
Generation: 6
Owner References:
API Version: work.open-cluster-management.io/v1
Kind: AppliedManifestWork
Name: 661184cbe6aabc283e2f4acb234afb291390b8b4b3dd10af342eca0c4e7e3f41-cephfs-sub-busybox15-placement-1-drpc-busybox-workloads-15-vrg-mw
UID: 79905b6c-78f9-414c-abc9-a6506a5cf852
Resource Version: 47644387
UID: 61c8fe31-6d15-4b42-876e-2f5d9f8d55af
Spec:
Action: Failover
Async:
Replication Class Selector:
Scheduling Interval: 5m
Volume Snapshot Class Selector:
Pvc Selector:
Match Labels:
Appname: busybox_app3_cephfs
Replication State: primary
s3Profiles:
s3profile-amagrawa-c1-29a-ocs-storagecluster
s3profile-amagrawa-c2-29a-ocs-storagecluster
Vol Sync:
Status:
Conditions:
Last Transition Time: 2024-05-16T08:00:28Z
Message: All VolSync PVCs are ready
Observed Generation: 6
Reason: Ready
Status: True
Type: DataReady
Last Transition Time: 2024-05-16T08:00:28Z
Message: Not all VolSync PVCs are protected
Observed Generation: 6
Reason: DataProtected
Status: False
Type: DataProtected
Last Transition Time: 2024-05-16T08:00:16Z
Message: Nothing to restore
Observed Generation: 6
Reason: Restored
Status: True
Type: ClusterDataReady
Last Transition Time: 2024-05-16T08:00:28Z
Message: Not all VolSync PVCs are protected
Observed Generation: 6
Reason: DataProtected
Status: False
Type: ClusterDataProtected
Kube Object Protection:
Last Update Time: 2024-05-23T16:40:25Z
Observed Generation: 6
Protected PV Cs:
Access Modes:
ReadWriteMany
Annotations:
apps.open-cluster-management.io/hosting-subscription: busybox-workloads-15/cephfs-sub-busybox15-subscription-1
apps.open-cluster-management.io/reconcile-option: merge
Conditions:
Last Transition Time: 2024-05-16T08:00:16Z
Message: Ready
Observed Generation: 6
Reason: SourceInitialized
Status: True
Type: ReplicationSourceSetup
Last Transition Time: 2024-05-16T07:59:24Z
Message: PVC restored
Observed Generation: 5
Reason: Restored
Status: True
Type: PVsRestored
Labels:
App: cephfs-sub-busybox15
app.kubernetes.io/part-of: cephfs-sub-busybox15
Appname: busybox_app3_cephfs
apps.open-cluster-management.io/reconcile-rate: medium
velero.io/backup-name: acm-resources-schedule-20240516070016
velero.io/restore-name: restore-acm-acm-resources-schedule-20240516070016
Name: busybox-pvc-1
Namespace: busybox-workloads-15
Protected By Vol Sync: true
Replication ID:
Id:
Resources:
Requests:
Storage: 94Gi
Storage Class Name: ocs-storagecluster-cephfs
Storage ID:
Id:
State: Primary
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal PrimaryVRGProcessSuccess 63m (x42 over 3h22m) controller_VolumeReplicationGroup Primary Success
Normal PrimaryVRGProcessSuccess 20m (x5 over 62m) controller_VolumeReplicationGroup Primary Success
C2 has replication source too but that's expected (as failover is successful and workload is running on this cluster)
oc get replicationsources.volsync.backube -A
NAMESPACE NAME SOURCE LAST SYNC DURATION NEXT SYNC
busybox-workloads-15 busybox-pvc-1 busybox-pvc-1
Expected results: Failover should complete with the last restored PVC state when replication is missing.
Additional info: