Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.15
Component/s: odf-dr/ramen
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2271819
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.19
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
ACM 2.10 GA'ed
ODF 4.15 GA'ed
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
OCP 4.15.0-0.nightly-2024-03-24-023440
VolSync 0.9.0
Submariner 0.17 (GA'ed alongside ACM 2.10)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Deployed DR protected 6 RBD and 6 CephFS workloads on C1 over a RDR setup of both subscription and appset types (1 each) and failedover (with all clusters up and running) and relocated them in such a way that they are finally running on C1 and maintain a unique state such as Deployed, FailedOver and Relocated (check drpc output below). Such as if busybox-1 is failedover to C2, it is failedover back to C1 and so on (with all clusters up and running).

We also have 4 workloads (2 RBD and 2 CephFS) on C2 and they remain as it is in the Deployed state.

2. After 2nd operation when workloads are finally running on C1, let IOs continue and ensure data sync is progressing well.
3.

Actual results:

Current drpc looks like this:

amanagrawal@Amans-MacBook-Pro acm % drpc
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 24h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:37Z 11m0.015388203s True
busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 24h amagrawa-c1-25m Deployed Completed 2024-03-26T12:33:38Z 21.041655484s True
busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 24h amagrawa-c2-25m Deployed Completed 2024-03-26T12:34:37Z 23.056227975s True
busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 24h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:44:02Z 1m58.347505121s True
busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 23h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:44:12Z 3m12.33386173s True
busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 23h amagrawa-c1-25m Deployed Completed 2024-03-26T12:44:01Z 46.116922842s True
busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 23h amagrawa-c2-25m Deployed Completed 2024-03-26T12:44:46Z 51.147447868s True
busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 24h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:32Z 4m3.186408297s True
openshift-gitops cephfs-appset-busybox5-placement-drpc 25h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:47Z 2m47.983665745s True
openshift-gitops cephfs-appset-busybox6-placement-drpc 25h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:55Z 3m20.35606035s True
openshift-gitops cephfs-appset-busybox7-placement-drpc 25h amagrawa-c1-25m Deployed Completed 2024-03-26T10:52:36Z 31.14920314s True
openshift-gitops cephfs-appset-busybox8-placement-drpc 25h amagrawa-c2-25m Deployed Completed 2024-03-26T10:53:37Z 31.18268721s True
openshift-gitops rbd-appset-busybox1-placement-drpc 26h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:23Z 5m42.208383384s True
openshift-gitops rbd-appset-busybox2-placement-drpc 26h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:27Z 3m33.157546816s True
openshift-gitops rbd-appset-busybox3-placement-drpc 26h amagrawa-c1-25m Deployed Completed 2024-03-26T10:31:04Z 1.043044061s True
openshift-gitops rbd-appset-busybox4-placement-drpc 26h amagrawa-c2-25m Deployed Completed 2024-03-26T22:36:39Z 21.046105696s True

We see that the last sync for below workloads happened 12hrs ago however the max. sync interval is 15mins (least 5mins). However, there were 2 other cephfs workloads where sync was stuck but recovered on it's own.

amanagrawal@Amans-MacBook-Pro acm % group|grep 2024-03-26 -A1
lastGroupSyncTime: "2024-03-26T22:31:05Z"
namespace: busybox-workloads-13
–
lastGroupSyncTime: "2024-03-26T22:31:08Z"
namespace: busybox-workloads-14
–
lastGroupSyncTime: "2024-03-26T22:31:03Z"
namespace: busybox-workloads-15
–
lastGroupSyncTime: "2024-03-26T22:31:09Z"
namespace: busybox-workloads-5
–
lastGroupSyncTime: "2024-03-26T22:31:17Z"
namespace: busybox-workloads-7

These workloads are running in namespaces busybox-workloads-5/7/13/14/15 (check above output).

amanagrawal@Amans-MacBook-Pro acm % date -u
Wed Mar 27 12:39:15 UTC 2024

If we compare current time in UTC and lastGroupSyncTime, the last sync happened 12hrs ago.

Both C1 and C2 managed clusters have both mds-a and mds-b blocklisted.

C1-

amanagrawal@Amans-MacBook-Pro c1 % pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bcc778cxqhsj 2/2 Running 8 (158m ago) 26h 10.130.2.130 compute-2 <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-d9f7c6669vrgf 2/2 Running 8 (153m ago) 178m 10.131.0.163 compute-0 <none> <none>

amanagrawal@Amans-MacBook-Pro c1 % blocklist|grep 10.130.2.130
10.130.2.130:6800/3783322891 2024-03-27T22:34:11.076083+0000
10.130.2.130:6801/1044670476 2024-03-28T10:01:41.112417+0000
10.130.2.130:6801/4200338083 2024-03-28T09:52:41.080940+0000
10.130.2.130:6801/3783322891 2024-03-27T22:34:11.076083+0000
10.130.2.130:6801/510958469 2024-03-28T09:49:41.075844+0000
10.130.2.130:6801/2938560102 2024-03-28T09:46:11.079514+0000
10.130.2.130:6800/2938560102 2024-03-28T09:46:11.079514+0000
10.130.2.130:6801/3641427817 2024-03-28T09:58:41.081184+0000
10.130.2.130:6800/4200338083 2024-03-28T09:52:41.080940+0000
10.130.2.130:6800/2752233664 2024-03-28T09:55:41.078792+0000
10.130.2.130:6800/3641427817 2024-03-28T09:58:41.081184+0000
10.130.2.130:6800/510958469 2024-03-28T09:49:41.075844+0000
10.130.2.130:6801/2752233664 2024-03-28T09:55:41.078792+0000
10.130.2.130:6800/1044670476 2024-03-28T10:01:41.112417+0000
10.130.2.130:6801/1943230982 2024-03-28T10:04:41.076518+0000
10.130.2.130:6800/1943230982 2024-03-28T10:04:41.076518+0000

amanagrawal@Amans-MacBook-Pro c1 % blocklist|grep 10.131.0.163
10.131.0.163:6801/505330432 2024-03-28T10:09:29.564093+0000
10.131.0.163:6801/1543846151 2024-03-28T10:03:29.593834+0000
10.131.0.163:6801/1862885432 2024-03-28T10:06:29.587355+0000
10.131.0.163:6800/505330432 2024-03-28T10:09:29.564093+0000
10.131.0.163:6800/3933062518 2024-03-28T09:54:29.567594+0000
10.131.0.163:6800/1543846151 2024-03-28T10:03:29.593834+0000
10.131.0.163:6800/3396601716 2024-03-28T09:47:59.572839+0000
10.131.0.163:6800/2455676576 2024-03-28T09:57:29.567035+0000
10.131.0.163:6801/3396601716 2024-03-28T09:47:59.572839+0000
10.131.0.163:6800/1117252964 2024-03-28T09:51:29.584094+0000
10.131.0.163:6801/1117252964 2024-03-28T09:51:29.584094+0000
10.131.0.163:6801/3933062518 2024-03-28T09:54:29.567594+0000
10.131.0.163:6800/1862885432 2024-03-28T10:06:29.587355+0000
10.131.0.163:6801/2455676576 2024-03-28T09:57:29.567035+0000
10.131.0.163:6800/2577235756 2024-03-28T10:00:29.577868+0000
10.131.0.163:6801/2577235756 2024-03-28T10:00:29.577868+0000

C2-

amanagrawal@Amans-MacBook-Pro c2 % pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-87cc58d5ksjth 2/2 Running 3 (14h ago) 26h 10.129.2.133 compute-1 <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69c75b5b4lvq6 2/2 Running 3 (14h ago) 26h 10.131.0.47 compute-0 <none> <none>

amanagrawal@Amans-MacBook-Pro c2 % blocklist|grep 10.129.2.133
10.129.2.133:6800/1876809845 2024-03-27T22:40:14.610813+0000
10.129.2.133:6801/1781001070 2024-03-27T22:37:14.613779+0000
10.129.2.133:6800/1781001070 2024-03-27T22:37:14.613779+0000
10.129.2.133:6801/1876809845 2024-03-27T22:40:14.610813+0000
10.129.2.133:6801/3585694807 2024-03-27T22:34:14.622503+0000
10.129.2.133:6800/3585694807 2024-03-27T22:34:14.622503+0000

amanagrawal@Amans-MacBook-Pro c2 % blocklist|grep 10.131.0.47
10.131.0.47:6801/2664558744 2024-03-27T22:40:12.393325+0000
10.131.0.47:6800/2664558744 2024-03-27T22:40:12.393325+0000
10.131.0.47:6801/1314658319 2024-03-27T22:37:12.394242+0000
10.131.0.47:6800/1314658319 2024-03-27T22:37:12.394242+0000

At least one of the dst pod in each of the above mentioned namespaces is stuck for more than 12hrs and isn't able to recover.

*No node related operation was performed on this setup.*

Expected results: MDS shouldn't be blocklisted and data sync for CephFS workloads should continue as expected.

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty