Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.15
Component/s: odf-dr/ramen
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2279260
Release Note Type:
Release Note Not Required
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
ACM 2.10.2 GA'ed
MCE 2.5.2
ODF 4.15.2-1 GA'ed
ceph version 17.2.6-209.el9cp (e9529323dd7ab3b0e8cdf84e17a1b58c2b42948c) quincy (stable)
OCP 4.15.0-0.nightly-2024-04-30-234425
Submariner 0.17.1 GA'ed
VolSync 0.9.1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
****Active hub co-situated with primary managed cluster****

1. When we have multiple workloads (RBD and CephFS) of both subscription and appset types (pull model) in Deployed state running on primary managed cluster (C1) which goes down along with
active hub cluster during site failure at site-1, perform hub recovery and move to passive hub at site-2 (which is co-situated with secondary managed cluster C2).
2. Ensure the available managed cluster C2 is successfully imported on the RHACM console of the passive hub, and DRPolicy gets validated.
2. After DRPC is restored, recover the down managed cluster C1 and ensure it's successfully imported on the RHACM console.
4. Let IOs continue for some time (30mins-1hr) and ensure data sync is progressing well.
5. Now failover some of the workloads (with both managed clusters up and running) and relocate remaining ones to the C2 managed cluster during the eviction period timeout (which is currently set to 24hrs).

Actual results: [RDR] [Hub recovery] [Co-situated] Relocate operation and cleanup after failover remains stuck during the eviction period timeout

Hub-

oc get drpc -o wide -A
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
busybox-workloads-101 rbd-sub-busybox101-placement-1-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating EnsuringVolumesAreSecondary 2024-05-05T17:32:04Z False
busybox-workloads-103 cephfs-sub-busybox103-placement-1-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating RunningFinalSync 2024-05-05T17:31:53Z True
openshift-gitops cephfs-appset-busybox102-placement-drpc 14h amagrawa-c2-13apr amagrawa-c1-13apr Relocate Relocating RunningFinalSync 2024-05-05T17:31:45Z True
openshift-gitops rbd-appset-busybox100-placement-drpc 14h amagrawa-c1-13apr amagrawa-c2-13apr Failover FailedOver Cleaning Up 2024-05-05T17:31:57Z False

Failover for rbd-appset-busybox100-placement-drpc worked but cleanup is stuck, however relocate of all other workloads in stuck.

failedover/relocated from C1 to C2-

C2-

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100
NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41 14h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41 Bound pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6 42Gi RWO ocs-storagecluster-ceph-rbd 14h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc primary Primary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-41-5c55b45d49-qh7v8 1/1 Running 0 13h 10.129.2.51 compute-2 <none> <none>

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101
No resources found in busybox-workloads-101 namespace.

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-542ab575-db38-4187-bbb0-70697ea232f3 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc secondary Secondary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-mw285 1/1 Running 0 2m41s 10.129.2.156 compute-2 <none> <none>

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-3256f1e5-79e3-43ff-96cb-e0b727ffcc74 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc secondary Secondary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-2q7zp 1/1 Running 0 2m46s 10.129.2.155 compute-2 <none> <none>

C1-

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-100
NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41 3d16h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41 Bound pvc-7c5e424d-b75a-495d-8745-4d3220fc48e6 42Gi RWO ocs-storagecluster-ceph-rbd 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-appset-busybox100-placement-drpc secondary Primary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-41-5c55b45d49-fngg2 1/1 Running 2 3d16h 10.128.3.196 compute-0 <none> <none>

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-101
NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc-41 3d16h rbd-volumereplicationclass-473128587 busybox-pvc-41 primary Primary

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-41 Bound pvc-ea52541b-acb4-4ecb-afd6-a00925bf3583 42Gi RWO ocs-storagecluster-ceph-rbd 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sub-busybox101-placement-1-drpc secondary Primary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-41-5c55b45d49-59tbp 1/1 Running 2 3d16h 10.128.3.198 compute-0 <none> <none>

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-102
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-46f34593-ba2f-435c-801d-66b7371fd359 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox102-placement-drpc primary Primary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-1-7f9b67dc95-wq4tr 1/1 Running 2 3d16h 10.128.3.208 compute-0 <none> <none>

oc get vr,pvc,vrg,pods -o wide -n busybox-workloads-103
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1 Bound pvc-40f27ae0-e3f5-4e74-822e-0eab289f3232 94Gi RWX ocs-storagecluster-cephfs 3d16h Filesystem

NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-sub-busybox103-placement-1-drpc primary Primary

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/busybox-1-7f9b67dc95-v6h4s 1/1 Running 2 3d16h 10.128.3.209 compute-0 <none> <none>

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/05may24/

Expected results: Admin should be able to successfully failover/relocate the workloads independent of eviction period timeout post hub recovery.

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty