Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-484

[2272495] [RDR] CephFS snapshots: VolumeSnapshot->ReadyToUse is false and stays false forever due to timeout error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.15
    • ceph/RADOS/x86
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      ACM 2.10 GA'ed
      ODF 4.15 GA'ed
      ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
      OCP 4.15.0-0.nightly-2024-03-24-023440
      VolSync 0.9.0
      Submariner 0.17 (GA'ed alongside ACM 2.10)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?
      No, but CLI workaround is available

      Run ""oc get volumesnapshot -A | grep false"" where the workloads are primary and delete the stale volumesnapshots which are stuck (older ones only). If they are not deleted, remove the finalizers for the VolumeSnapshot and their corresponding VolumeSnapshotContent.

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Deployed DR protected 6 RBD and 6 CephFS workloads on C1 over a RDR setup of both subscription and appset types (1 each) and failedover (with all clusters up and running) and relocated them in such a way that they are finally running on C1 and maintain a unique state such as Deployed, FailedOver and Relocated (check drpc output below). Such as if busybox-1 is failedover to C2, it is failedover back to C1 and so on (with all clusters up and running).

      We also have 4 workloads (2 RBD and 2 CephFS) on C2 and they remain as it is in the Deployed state.

      2. After 2nd operation when workloads are finally running on C1, let IOs continue overnight and ensure data sync is progressing well.
      3.

      Actual results:
      Current drpc looks like this:

      amanagrawal@Amans-MacBook-Pro ~ % drpc
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      busybox-workloads-10 rbd-sub-busybox10-placement-1-drpc 2d7h amagrawa-c1-29m amagrawa-c2-29m Relocate Relocated Completed 2024-03-30T11:46:04Z 3m34.194136561s True
      busybox-workloads-11 rbd-sub-busybox11-placement-1-drpc 2d7h amagrawa-c1-29m Deployed Completed 2024-03-30T08:50:53Z 17.08813336s True
      busybox-workloads-12 rbd-sub-busybox12-placement-1-drpc 2d7h amagrawa-c1-29m Deployed Completed 2024-03-30T08:53:01Z 15.048723358s True
      busybox-workloads-13 rbd-sub-busybox13-placement-1-drpc 2d7h amagrawa-c2-29m Deployed Completed 2024-03-30T08:54:07Z 1.052097882s True
      busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 2d7h amagrawa-c2-29m amagrawa-c1-29m Failover FailedOver Completed 2024-03-30T11:45:33Z 2m13.87941597s True
      busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 2d7h amagrawa-c1-29m amagrawa-c2-29m Relocate Relocated Completed 2024-03-30T11:45:41Z 2m25.765604202s True
      busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 2d7h amagrawa-c1-29m Deployed Completed 2024-03-30T08:58:36Z 45.206438694s True
      busybox-workloads-17 cephfs-sub-busybox17-placement-1-drpc 2d7h amagrawa-c2-29m Deployed Completed 2024-03-30T09:00:45Z 45.250150435s True
      busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 2d7h amagrawa-c2-29m amagrawa-c1-29m Failover FailedOver Completed 2024-03-30T11:45:51Z 3m52.912910746s True
      openshift-gitops cephfs-appset-busybox5-placement-drpc 2d7h amagrawa-c2-29m amagrawa-c1-29m Failover FailedOver Completed 2024-03-30T11:45:22Z 1m54.720876193s True
      openshift-gitops cephfs-appset-busybox6-placement-drpc 2d7h amagrawa-c1-29m amagrawa-c2-29m Relocate Relocated Completed 2024-03-30T11:45:27Z 4m59.996397972s True
      openshift-gitops cephfs-appset-busybox7-placement-drpc 2d7h amagrawa-c1-29m Deployed Completed 2024-03-30T08:33:50Z 45.322308361s True
      openshift-gitops cephfs-appset-busybox8-placement-drpc 2d7h amagrawa-c2-29m Deployed Completed 2024-03-30T08:34:58Z 51.174493729s True
      openshift-gitops rbd-appset-busybox1-placement-drpc 2d7h amagrawa-c2-29m amagrawa-c1-29m Failover FailedOver Completed 2024-03-30T11:46:12Z 4m35.129560857s True
      openshift-gitops rbd-appset-busybox2-placement-drpc 2d7h amagrawa-c1-29m amagrawa-c2-29m Relocate Relocated Completed 2024-03-30T11:46:15Z 10m42.195053331s True
      openshift-gitops rbd-appset-busybox3-placement-drpc 2d7h amagrawa-c1-29m Deployed Completed 2024-03-30T08:29:02Z 19.120449331s True
      openshift-gitops rbd-appset-busybox4-placement-drpc 2d7h amagrawa-c2-29m Deployed Completed 2024-03-30T08:30:30Z 1.049141938s True

      Data sync wasn't progressing for busybox-workloads-6, 7 and 14

      busybox-workloads-6 seems to recover and then busybox-workloads-8 seems to be impacted.

      amanagrawal@Amans-MacBook-Pro 01april24 % oc get volumesnapshot -A | grep false
      busybox-workloads-14 volsync-busybox-pvc-1-src false busybox-pvc-1 ocs-storagecluster-cephfsplugin-snapclass snapcontent-a42538b4-eeef-45c3-94cf-3ef9bdf4feae 28h
      busybox-workloads-7 volsync-busybox-pvc-4-src false busybox-pvc-4 ocs-storagecluster-cephfsplugin-snapclass snapcontent-ac363a4b-2b11-443a-b968-2a705381959d 28h
      busybox-workloads-8 busybox-pvc-2-20240401030130 false busybox-pvc-2 ocs-storagecluster-cephfsplugin-snapclass snapcontent-e9d04116-9cd3-4021-b11d-a2066a46eb0a 10h

      Expected results: VolumeSnapshot->ReadyToUse should be True for CephFS workloads and data sync should progress as expected while running IOs.

      Additional info:

              rzarzyns@redhat.com Radoslaw Zarzynski
              amagrawa@redhat.com Aman Agrawal
              Radoslaw Zarzynski
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: