Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-147

[2270740] [RDR] [Node failure] [CephFS] CSI CreateVolume is taking forever when creating ROX PVCs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.15
    • csi-driver
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Ceph-CSI Sprint 2024.4, Ceph-CSI Team sprint 2024.5, Ceph-CSI Team sprint 2024.6, Ceph-CSI Team sprint 2024.7, Ceph-CSI Team sprint 2024.8, Ceph-CSI Team sprint 2024.9
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      OCP 4.15.0-0.nightly-2024-03-12-010512
      ACM 2.10.0-DOWNSTREAM-2024-03-14-14-53-38
      ODF 4.15.0-158
      Submariner brew.registry.redhat.io/rh-osbs/iib:684361
      VolSync 0.8.1
      ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Depoy multiple CephFS workloads (heavy and light) (total 4 in this case, no RBD was running on this particular setup) on C1 of both appset (push method) and subscription types.
      2. Run IOs for a few hours. If data sync is progressing well, relocate all of them to C2 and reboot one of the worker nodes of C2 (preferredcluster) during the relocate operation. Turn it off for 2-3mins and bring it back online.
      3. Check relocate status and ensure data sync resumes for all the relocated workloads. Wait for 4-6hrs and let IOs continue.
      4. Repeat steps 2 and 3 a couple of times with wait time of 4-6hrs in between relocate operations and ensure relocate operation successfully completes and data sync resumes for all the workloads.

      During this test case execution, submariner issue https://issues.redhat.com/browse/ACM-10508 was hit which was recovered by applying the workaround (ODF tracker BZ2270064).

      But it was found that when the setup was left idle, data sync stopped/is lagging behind hours for all the 4 CephFS workloads deployed on this setup.

      Actual results: [RDR] [Node failure] [CephFS] CSI CreateVolume is taking forever when creating ROX PVCs

      Hub-

      amagrawa:acm$ drpc
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      busybox-workloads-1 cephfs-sub-busybox1-placement-1-drpc 3d8h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:38Z 8m14.054566433s True
      busybox-workloads-2 cephfs-sub-busybox2-placement-1-drpc 3d8h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:44Z 8m4.968031416s True
      openshift-gitops cephfs-appset-busybox3-placement-drpc 3d7h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:29Z 8m20.265612015s True
      openshift-gitops cephfs-appset-busybox4-placement-drpc 3d7h amagrawa-mc1 amagrawa-mc2 Relocate Relocated Completed 2024-03-19T17:04:33Z 8m16.564337649s True

      amagrawa:acm$ group
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
      namespace: busybox-workloads-1
      namespace: busybox-workloads-1
      lastGroupSyncTime: "2024-03-21T09:43:18Z"
      namespace: busybox-workloads-1
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
      namespace: busybox-workloads-2
      namespace: busybox-workloads-2
      lastGroupSyncTime: "2024-03-19T17:01:28Z"
      namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-21T09:33:26Z"
      namespace: busybox-workloads-3
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-21T09:45:14Z"
      namespace: busybox-workloads-4

      amagrawa:acm$ date -u
      Thursday 21 March 2024 03:49:07 PM UTC

      Requesting @bmekhiss@redhat.com to add further details to the bug to make it more cohere as he has been debugging it thoroughly.

      Expected results: Even after multiple attempts of node failure operation during relocate, data sync should successfully resume for all the CephFS workloads post successful relocate completion.

      Additional info:

              rar@redhat.com Rakshith R
              amagrawa@redhat.com Aman Agrawal
              Aman Agrawal
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: