Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-542

[2271666] [RDR] [Hub recovery] [Co-situated] Slowness observed in the data replication even when the cluster isn't heavily loaded

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.15
    • odf-dr/ramen
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
      OCP 4.15.0-0.nightly-2024-03-05-113700
      ODF 4.15.0-157
      ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
      Submariner brew.registry.redhat.io/rh-osbs/iib:680159
      VolSync 0.8.0

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      ****Active hub co-situated with primary managed cluster****

      1. After site failure (active hub and the primary managed cluster goes down) and moving to passive hub post hub recovery, all the workloads (RBD and CephFS) of both subscription and appset types and in different states Deployed, FailedOver, Relocated (1 each) which were running on primary managed cluster were failedover to the failovercluster (secondary) and the failover operation was successful.

      Workloads were successfully running on the failovercluster (secondary) and VRG both states were marked as Primary for all these workloads.

      We also had 4 workloads in the deployed state on C2 (RBD and CephFS, appset and subscription type, 1 each) and they remain as it is.

      2. Now recover the older primary managed cluster and ensure it's successfully imported on the RHACM console (if not, create auto-import-secret for this cluster on the passive hub).
      3. Monitor drpc cleanup status and lastGroupSyncTime for all the failedover workloads.
      4. Assuming cleanup happens successfully (though it doesn't (apply workarounds if needed as we have bugs for both RBD and CephFS)), data sync should resume for all the workloads. Let IOs continue for a few days (3-4) where sync should progress as excepted without firing the VolSync.DelayAlert (which fires critical level alert when the difference b/w lastGroupSyncTime for any workload (per namespace) and current time in UTC is beyond 3x the sync interval as per drpolicy assigned to that particular workload.

      Actual results: Slowness in data replication was observed even though the cluster is not heavily loaded and the total workload count is within the desired limit.

      (Note: Later after a week or so, it was found that the osd-2 of cluster C1 was down and was unable to recover if it is contributing to the slowness).

      rook-ceph-osd-2-d6855dd9b-fhx62 0/2 Init:0/4 0 5d7h <none> compute-1 <none> <none>

      amanagrawal@Amans-MacBook-Pro hub % oc describe pod rook-ceph-osd-2-d6855dd9b-fhx62
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning FailedAttachVolume 78s (x3146 over 4d10h) attachdetach-controller AttachVolume.Attach failed for volume "pvc-864f7514-df95-4e4f-bdc9-e7a663a9070e" : volume attachment is being deleted

      Then @bmekhiss@redhat.com followed this to get rid of the volume attachment issue with osd-2.
      https://kb.vmware.com/s/article/85213

      but it probably didn't help.

      After this, we collected the must gather which could be found here-
      http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/26mar24/

      Passive hub-

      amanagrawal@Amans-MacBook-Pro ~ % drpc
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      busybox-workloads-13 cephfs-sub-busybox13-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:02Z 45m29.999316502s True
      busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:08Z 45m24.522406539s True
      busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:15Z 45m17.399723614s True
      busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:15Z 51m16.600450889s True
      busybox-workloads-5 rbd-sub-busybox5-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:01Z 1h49m3.070653116s True
      busybox-workloads-6 rbd-sub-busybox6-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:01Z 1h48m52.893419572s True
      busybox-workloads-7 rbd-sub-busybox7-placement-1-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:23:49Z 1h49m5.411490083s True
      busybox-workloads-8 rbd-sub-busybox8-placement-1-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 724.385424ms True
      openshift-gitops cephfs-appset-busybox10-placement-drpc 18d amagrawa-c1 amagrawa-c2 Relocate Relocated Completed 2024-03-26T17:29:48Z 29m16.062088831s True
      openshift-gitops cephfs-appset-busybox11-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:36Z 44m56.299638791s True
      openshift-gitops cephfs-appset-busybox12-placement-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 51m14.942082281s True
      openshift-gitops cephfs-appset-busybox9-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:25:31Z 44m58.19020063s True
      openshift-gitops rbd-appset-busybox1-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:26Z 3h25m19.310948353s True
      openshift-gitops rbd-appset-busybox2-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:25Z 436h13m52.709973551s True
      openshift-gitops rbd-appset-busybox3-placement-drpc 18d amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2024-03-08T11:24:13Z 436h13m39.761048611s True
      openshift-gitops rbd-appset-busybox4-placement-drpc 18d amagrawa-c2 Deployed Completed 2024-03-08T11:19:14Z 297.972857ms True

      amanagrawal@Amans-MacBook-Pro ~ % date -u
      Tue Mar 26 19:33:39 UTC 2024

      amanagrawal@Amans-MacBook-Pro ~ % group
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
      namespace: busybox-workloads-13
      namespace: busybox-workloads-13
      lastGroupSyncTime: "2024-03-26T15:51:26Z"
      namespace: busybox-workloads-13
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
      namespace: busybox-workloads-14
      namespace: busybox-workloads-14
      lastGroupSyncTime: "2024-03-26T17:35:43Z"
      namespace: busybox-workloads-14
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-15
      namespace: busybox-workloads-15
      namespace: busybox-workloads-15
      lastGroupSyncTime: "2024-03-26T17:35:24Z"
      namespace: busybox-workloads-15
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-16
      namespace: busybox-workloads-16
      namespace: busybox-workloads-16
      lastGroupSyncTime: "2024-03-26T17:36:42Z"
      namespace: busybox-workloads-16
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-5
      namespace: busybox-workloads-5
      namespace: busybox-workloads-5
      lastGroupSyncTime: "2024-03-26T19:30:00Z"
      namespace: busybox-workloads-5
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
      namespace: busybox-workloads-6
      namespace: busybox-workloads-6
      lastGroupSyncTime: "2024-03-26T19:30:02Z"
      namespace: busybox-workloads-6
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
      namespace: busybox-workloads-7
      namespace: busybox-workloads-7
      lastGroupSyncTime: "2024-03-26T19:30:02Z"
      namespace: busybox-workloads-7
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
      namespace: busybox-workloads-8
      namespace: busybox-workloads-8
      lastGroupSyncTime: "2024-03-26T19:30:00Z"
      namespace: busybox-workloads-8
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:11:40Z"
      namespace: busybox-workloads-10
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T17:37:57Z"
      namespace: busybox-workloads-11
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:27:18Z"
      namespace: busybox-workloads-12
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T16:27:29Z"
      namespace: busybox-workloads-9
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:30:02Z"
      namespace: busybox-workloads-1
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:30:02Z"
      namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:30:00Z"
      namespace: busybox-workloads-3
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2024-03-26T19:30:00Z"
      namespace: busybox-workloads-4

      Almost 6-8 workloads are firing the alert since the last week as lastGroupSyncTime is delayed.

      C1-

      amanagrawal@Amans-MacBook-Pro c1 % cephdf
      — RAW STORAGE —
      CLASS SIZE AVAIL USED RAW USED %RAW USED
      ssd 1.5 TiB 1.4 TiB 133 GiB 133 GiB 8.64
      TOTAL 1.5 TiB 1.4 TiB 133 GiB 133 GiB 8.64

      — POOLS —
      POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
      ocs-storagecluster-cephblockpool 1 64 255 GiB 118.47k 83 GiB 6.81 571 GiB
      ocs-storagecluster-cephobjectstore.rgw.otp 2 8 0 B 0 0 B 0 381 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 3 8 0 B 0 0 B 0 381 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 8 124 KiB 11 248 KiB 0 571 GiB
      ocs-storagecluster-cephobjectstore.rgw.meta 5 8 10 KiB 17 156 KiB 0 571 GiB
      ocs-storagecluster-cephobjectstore.rgw.log 6 8 944 KiB 340 3.7 MiB 0 571 GiB
      ocs-storagecluster-cephobjectstore.rgw.control 7 8 0 B 8 0 B 0 571 GiB
      .rgw.root 8 8 8.7 KiB 16 180 KiB 0 571 GiB
      ocs-storagecluster-cephfilesystem-metadata 9 32 3.5 GiB 1.98k 7.0 GiB 0.61 571 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 32 195 KiB 111 1.3 MiB 0 571 GiB
      .mgr 11 1 1.0 MiB 2 2.1 MiB 0 571 GiB
      ocs-storagecluster-cephfilesystem-data0 12 32 17 GiB 1.17M 34 GiB 2.90 571 GiB

      C2-

      amanagrawal@Amans-MacBook-Pro c2 % cephdf
      — RAW STORAGE —
      CLASS SIZE AVAIL USED RAW USED %RAW USED
      ssd 1.5 TiB 1.4 TiB 80 GiB 80 GiB 5.19
      TOTAL 1.5 TiB 1.4 TiB 80 GiB 80 GiB 5.19

      — POOLS —
      POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
      ocs-storagecluster-cephblockpool 1 64 13 GiB 5.57k 38 GiB 3.03 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 2 8 0 B 0 0 B 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.otp 3 8 0 B 0 0 B 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.control 4 8 0 B 8 0 B 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.log 5 8 692 KiB 340 3.9 MiB 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.index 6 8 874 KiB 11 2.6 MiB 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.meta 7 8 4.2 KiB 17 148 KiB 0 408 GiB
      .rgw.root 8 8 5.8 KiB 16 180 KiB 0 408 GiB
      ocs-storagecluster-cephfilesystem-metadata 9 32 1.4 GiB 614 4.3 GiB 0.35 408 GiB
      .mgr 10 1 769 KiB 2 2.3 MiB 0 408 GiB
      ocs-storagecluster-cephobjectstore.rgw.buckets.data 11 32 1.8 GiB 1.61k 5.4 GiB 0.44 408 GiB
      ocs-storagecluster-cephfilesystem-data0 12 32 8.5 GiB 1.11M 26 GiB 2.06 408 GiB

      The utilization is after running IOs for more than a week.

      Expected results: Slowness shouldn't be observed in the data replication when the cluster isn't heavily loaded.

      Additional info:

              pdonnell@redhat.com Patrick Donnelly
              amagrawa@redhat.com Aman Agrawal
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: