Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-712

[2271819] [RDR] MDS blocklist while running IOs impacts data sync for most of the CephFS workloads

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.15
    • odf-dr/ramen
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      ACM 2.10 GA'ed
      ODF 4.15 GA'ed
      ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
      OCP 4.15.0-0.nightly-2024-03-24-023440
      VolSync 0.9.0
      Submariner 0.17 (GA'ed alongside ACM 2.10)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Deployed DR protected 6 RBD and 6 CephFS workloads on C1 over a RDR setup of both subscription and appset types (1 each) and failedover (with all clusters up and running) and relocated them in such a way that they are finally running on C1 and maintain a unique state such as Deployed, FailedOver and Relocated (check drpc output below). Such as if busybox-1 is failedover to C2, it is failedover back to C1 and so on (with all clusters up and running).

      We also have 4 workloads (2 RBD and 2 CephFS) on C2 and they remain as it is in the Deployed state.

      2. After 2nd operation when workloads are finally running on C1, let IOs continue and ensure data sync is progressing well.
      3.

      Actual results:

      Current drpc looks like this:

      amanagrawal@Amans-MacBook-Pro acm % drpc
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 24h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:37Z 11m0.015388203s True
      busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 24h amagrawa-c1-25m Deployed Completed 2024-03-26T12:33:38Z 21.041655484s True
      busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 24h amagrawa-c2-25m Deployed Completed 2024-03-26T12:34:37Z 23.056227975s True
      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 24h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:44:02Z 1m58.347505121s True
      busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 23h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:44:12Z 3m12.33386173s True
      busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 23h amagrawa-c1-25m Deployed Completed 2024-03-26T12:44:01Z 46.116922842s True
      busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 23h amagrawa-c2-25m Deployed Completed 2024-03-26T12:44:46Z 51.147447868s True
      busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 24h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:32Z 4m3.186408297s True
      openshift-gitops cephfs-appset-busybox5-placement-drpc 25h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:47Z 2m47.983665745s True
      openshift-gitops cephfs-appset-busybox6-placement-drpc 25h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:55Z 3m20.35606035s True
      openshift-gitops cephfs-appset-busybox7-placement-drpc 25h amagrawa-c1-25m Deployed Completed 2024-03-26T10:52:36Z 31.14920314s True
      openshift-gitops cephfs-appset-busybox8-placement-drpc 25h amagrawa-c2-25m Deployed Completed 2024-03-26T10:53:37Z 31.18268721s True
      openshift-gitops rbd-appset-busybox1-placement-drpc 26h amagrawa-c2-25m amagrawa-c1-25m Failover FailedOver Completed 2024-03-26T16:43:23Z 5m42.208383384s True
      openshift-gitops rbd-appset-busybox2-placement-drpc 26h amagrawa-c1-25m amagrawa-c2-25m Relocate Relocated Completed 2024-03-26T16:43:27Z 3m33.157546816s True
      openshift-gitops rbd-appset-busybox3-placement-drpc 26h amagrawa-c1-25m Deployed Completed 2024-03-26T10:31:04Z 1.043044061s True
      openshift-gitops rbd-appset-busybox4-placement-drpc 26h amagrawa-c2-25m Deployed Completed 2024-03-26T22:36:39Z 21.046105696s True

      We see that the last sync for below workloads happened 12hrs ago however the max. sync interval is 15mins (least 5mins). However, there were 2 other cephfs workloads where sync was stuck but recovered on it's own.

      amanagrawal@Amans-MacBook-Pro acm % group|grep 2024-03-26 -A1
      lastGroupSyncTime: "2024-03-26T22:31:05Z"
      namespace: busybox-workloads-13

      lastGroupSyncTime: "2024-03-26T22:31:08Z"
      namespace: busybox-workloads-14

      lastGroupSyncTime: "2024-03-26T22:31:03Z"
      namespace: busybox-workloads-15

      lastGroupSyncTime: "2024-03-26T22:31:09Z"
      namespace: busybox-workloads-5

      lastGroupSyncTime: "2024-03-26T22:31:17Z"
      namespace: busybox-workloads-7

      These workloads are running in namespaces busybox-workloads-5/7/13/14/15 (check above output).

      amanagrawal@Amans-MacBook-Pro acm % date -u
      Wed Mar 27 12:39:15 UTC 2024

      If we compare current time in UTC and lastGroupSyncTime, the last sync happened 12hrs ago.

      Both C1 and C2 managed clusters have both mds-a and mds-b blocklisted.

      C1-

      amanagrawal@Amans-MacBook-Pro c1 % pods|grep mds
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bcc778cxqhsj 2/2 Running 8 (158m ago) 26h 10.130.2.130 compute-2 <none> <none>
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-d9f7c6669vrgf 2/2 Running 8 (153m ago) 178m 10.131.0.163 compute-0 <none> <none>

      amanagrawal@Amans-MacBook-Pro c1 % blocklist|grep 10.130.2.130
      10.130.2.130:6800/3783322891 2024-03-27T22:34:11.076083+0000
      10.130.2.130:6801/1044670476 2024-03-28T10:01:41.112417+0000
      10.130.2.130:6801/4200338083 2024-03-28T09:52:41.080940+0000
      10.130.2.130:6801/3783322891 2024-03-27T22:34:11.076083+0000
      10.130.2.130:6801/510958469 2024-03-28T09:49:41.075844+0000
      10.130.2.130:6801/2938560102 2024-03-28T09:46:11.079514+0000
      10.130.2.130:6800/2938560102 2024-03-28T09:46:11.079514+0000
      10.130.2.130:6801/3641427817 2024-03-28T09:58:41.081184+0000
      10.130.2.130:6800/4200338083 2024-03-28T09:52:41.080940+0000
      10.130.2.130:6800/2752233664 2024-03-28T09:55:41.078792+0000
      10.130.2.130:6800/3641427817 2024-03-28T09:58:41.081184+0000
      10.130.2.130:6800/510958469 2024-03-28T09:49:41.075844+0000
      10.130.2.130:6801/2752233664 2024-03-28T09:55:41.078792+0000
      10.130.2.130:6800/1044670476 2024-03-28T10:01:41.112417+0000
      10.130.2.130:6801/1943230982 2024-03-28T10:04:41.076518+0000
      10.130.2.130:6800/1943230982 2024-03-28T10:04:41.076518+0000

      amanagrawal@Amans-MacBook-Pro c1 % blocklist|grep 10.131.0.163
      10.131.0.163:6801/505330432 2024-03-28T10:09:29.564093+0000
      10.131.0.163:6801/1543846151 2024-03-28T10:03:29.593834+0000
      10.131.0.163:6801/1862885432 2024-03-28T10:06:29.587355+0000
      10.131.0.163:6800/505330432 2024-03-28T10:09:29.564093+0000
      10.131.0.163:6800/3933062518 2024-03-28T09:54:29.567594+0000
      10.131.0.163:6800/1543846151 2024-03-28T10:03:29.593834+0000
      10.131.0.163:6800/3396601716 2024-03-28T09:47:59.572839+0000
      10.131.0.163:6800/2455676576 2024-03-28T09:57:29.567035+0000
      10.131.0.163:6801/3396601716 2024-03-28T09:47:59.572839+0000
      10.131.0.163:6800/1117252964 2024-03-28T09:51:29.584094+0000
      10.131.0.163:6801/1117252964 2024-03-28T09:51:29.584094+0000
      10.131.0.163:6801/3933062518 2024-03-28T09:54:29.567594+0000
      10.131.0.163:6800/1862885432 2024-03-28T10:06:29.587355+0000
      10.131.0.163:6801/2455676576 2024-03-28T09:57:29.567035+0000
      10.131.0.163:6800/2577235756 2024-03-28T10:00:29.577868+0000
      10.131.0.163:6801/2577235756 2024-03-28T10:00:29.577868+0000

      C2-

      amanagrawal@Amans-MacBook-Pro c2 % pods|grep mds
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-87cc58d5ksjth 2/2 Running 3 (14h ago) 26h 10.129.2.133 compute-1 <none> <none>
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-69c75b5b4lvq6 2/2 Running 3 (14h ago) 26h 10.131.0.47 compute-0 <none> <none>

      amanagrawal@Amans-MacBook-Pro c2 % blocklist|grep 10.129.2.133
      10.129.2.133:6800/1876809845 2024-03-27T22:40:14.610813+0000
      10.129.2.133:6801/1781001070 2024-03-27T22:37:14.613779+0000
      10.129.2.133:6800/1781001070 2024-03-27T22:37:14.613779+0000
      10.129.2.133:6801/1876809845 2024-03-27T22:40:14.610813+0000
      10.129.2.133:6801/3585694807 2024-03-27T22:34:14.622503+0000
      10.129.2.133:6800/3585694807 2024-03-27T22:34:14.622503+0000

      amanagrawal@Amans-MacBook-Pro c2 % blocklist|grep 10.131.0.47
      10.131.0.47:6801/2664558744 2024-03-27T22:40:12.393325+0000
      10.131.0.47:6800/2664558744 2024-03-27T22:40:12.393325+0000
      10.131.0.47:6801/1314658319 2024-03-27T22:37:12.394242+0000
      10.131.0.47:6800/1314658319 2024-03-27T22:37:12.394242+0000

      At least one of the dst pod in each of the above mentioned namespaces is stuck for more than 12hrs and isn't able to recover.

      *No node related operation was performed on this setup.*

      Expected results: MDS shouldn't be blocklisted and data sync for CephFS workloads should continue as expected.

      Additional info:

              pdonnell@redhat.com Patrick Donnelly
              amagrawa@redhat.com Aman Agrawal
              Benamar Mekhissi
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated: