-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
odf-4.15
-
None
-
False
-
-
False
-
?
-
?
-
If docs needed, set a value
-
-
-
Ceph-CSI Sprint 2024.4, Ceph-CSI Team sprint 2024.5, Ceph-CSI Team sprint 2024.6, Ceph-CSI Team sprint 2024.7, Ceph-CSI Team sprint 2024.8, Ceph-CSI Team sprint 2024.9
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Version of all relevant components (if applicable):
OCP 4.15.0-0.nightly-2024-03-12-010512
ACM 2.10.0-DOWNSTREAM-2024-03-14-14-53-38
ODF 4.15.0-158
Submariner brew.registry.redhat.io/rh-osbs/iib:684361
VolSync 0.8.1
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Depoy multiple CephFS workloads (heavy and light) (total 4 in this case, no RBD was running on this particular setup) on C1 of both appset (push method) and subscription types.
2. Run IOs for a few hours. If data sync is progressing well, relocate all of them to C2 and reboot one of the worker nodes of C2 (preferredcluster) during the relocate operation. Turn it off for 2-3mins and bring it back online.
3. Check relocate status and ensure data sync resumes for all the relocated workloads. Wait for 4-6hrs and let IOs continue.
4. Repeat steps 2 and 3 a couple of times with wait time of 4-6hrs in between relocate operations and ensure relocate operation successfully completes and data sync resumes for all the workloads.
During this test case execution, submariner issue https://issues.redhat.com/browse/ACM-10508 was hit which was recovered by applying the workaround (ODF tracker BZ2270064).
But it was found that when the setup was left idle, data sync stopped/is lagging behind hours for all the 4 CephFS workloads deployed on this setup.
Actual results: [RDR] [Node failure] [CephFS] CSI CreateVolume is taking forever when creating ROX PVCs
Hub-
amagrawa:acm$ drpc
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
busybox-workloads-1 cephfs-sub-busybox1-placement-1-drpc 3d8h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:38Z 8m14.054566433s True
busybox-workloads-2 cephfs-sub-busybox2-placement-1-drpc 3d8h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:44Z 8m4.968031416s True
openshift-gitops cephfs-appset-busybox3-placement-drpc 3d7h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:29Z 8m20.265612015s True
openshift-gitops cephfs-appset-busybox4-placement-drpc 3d7h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:33Z 8m16.564337649s True
amagrawa:acm$ group
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
namespace: busybox-workloads-1
namespace: busybox-workloads-1
lastGroupSyncTime: "2024-03-21T09:43:18Z"
namespace: busybox-workloads-1
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
namespace: busybox-workloads-2
namespace: busybox-workloads-2
lastGroupSyncTime: "2024-03-19T17:01:28Z"
namespace: busybox-workloads-2
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-21T09:33:26Z"
namespace: busybox-workloads-3
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2024-03-21T09:45:14Z"
namespace: busybox-workloads-4
amagrawa:acm$ date -u
Thursday 21 March 2024 03:49:07 PM UTC
Requesting @bmekhiss@redhat.com to add further details to the bug to make it more cohere as he has been debugging it thoroughly.
Expected results: Even after multiple attempts of node failure operation during relocate, data sync should successfully resume for all the CephFS workloads post successful relocate completion.
Additional info: