Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-526

[2250806] [RDR] [Hub recovery] Some of the dst pods remain stuck for different cephfs workloads due to which data sync can not progress

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.14
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests): While testing hub recovery on a Regional DR setup, we hit BZ2250152 but we were able to recover it using the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2250152#c6.

      However, it was found that data sync stopped for 4 out of 6 cephfs workloads as some of the dst pods for all these workloads remained stuck and couldn't progress-

      passive hub-

      amagrawa:~$ drpc|grep cephfs
      busybox-workloads-12 cephfs-sub-busybox-workloads-12-placement-1-drpc 4d18h amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True
      busybox-workloads-13 cephfs-sub-busybox-workloads-13-placement-1-drpc 4d18h amagrawa-10n-1 Relocate Relocated Completed True
      busybox-workloads-14 cephfs-sub-busybox-workloads-14-placement-1-drpc 4d18h amagrawa-10n-1 amagrawa-10n-2 Failover FailedOver Completed True
      openshift-gitops cephfs-appset-busybox-workloads-10-placement-drpc 4d18h amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True
      openshift-gitops cephfs-appset-busybox-workloads-11-placement-drpc 4d18h amagrawa-10n-1 Relocate Relocated Completed True
      openshift-gitops cephfs-appset-busybox-workloads-9-placement-drpc 4d18h amagrawa-10n-2 Relocate Relocated Completed True

      amagrawa:~$ group
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
      namespace: busybox-workloads-12
      namespace: busybox-workloads-12
      lastGroupSyncTime: "2023-11-18T14:43:43Z"
      namespace: busybox-workloads-12
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
      namespace: busybox-workloads-13
      namespace: busybox-workloads-13
      lastGroupSyncTime: "2023-11-18T14:37:46Z"
      namespace: busybox-workloads-13
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
      namespace: busybox-workloads-14
      namespace: busybox-workloads-14
      lastGroupSyncTime: "2023-11-21T07:28:18Z"
      namespace: busybox-workloads-14
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
      namespace: busybox-workloads-6
      namespace: busybox-workloads-6
      lastGroupSyncTime: "2023-11-21T08:10:01Z"
      namespace: busybox-workloads-6
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
      namespace: busybox-workloads-7
      namespace: busybox-workloads-7
      lastGroupSyncTime: "2023-11-21T08:10:39Z"
      namespace: busybox-workloads-7
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
      namespace: busybox-workloads-8
      namespace: busybox-workloads-8
      lastGroupSyncTime: "2023-11-21T08:10:18Z"
      namespace: busybox-workloads-8
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-18T14:34:06Z"
      namespace: busybox-workloads-10
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-18T14:37:44Z"
      namespace: busybox-workloads-11
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-21T07:27:44Z"
      namespace: busybox-workloads-9
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-21T08:10:21Z"
      namespace: busybox-workloads-1
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-21T08:10:01Z"
      namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-21T08:10:18Z"
      namespace: busybox-workloads-3
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
      namespace: openshift-gitops
      namespace: openshift-gitops
      lastGroupSyncTime: "2023-11-21T08:10:01Z"
      namespace: busybox-workloads-4

      amagrawa:~$ date -u
      Tuesday 21 November 2023 08:14:38 AM UTC

      data sync is working for busybox-workloads-9 and 14 (is behind by just a few hours but it's progressing), however, it has completed stopped for remaining 4 workloads.

      From C2-

      amagrawa:c2$ oc get pods -n busybox-workloads-10
      NAME READY STATUS RESTARTS AGE
      volsync-rsync-tls-dst-busybox-pvc-1-tw4tq 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-10-qx29b 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-11-hqwzs 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-11-kwbww 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-12-hqvw5 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-13-rxfdj 1/1 Running 1 7m31s
      volsync-rsync-tls-dst-busybox-pvc-14-9lj7b 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-15-hxndg 1/1 Running 0 15m
      volsync-rsync-tls-dst-busybox-pvc-16-6gnws 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-17-f579l 1/1 Running 0 7m43s
      volsync-rsync-tls-dst-busybox-pvc-18-drxsg 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-19-8vg78 1/1 Running 1 7m19s
      volsync-rsync-tls-dst-busybox-pvc-2-bv9jc 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-2-fv94l 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-20-q9828 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-3-jrdzf 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-4-pk6ns 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-5-8nxqg 1/1 Running 2 10m
      volsync-rsync-tls-dst-busybox-pvc-6-298fv 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-7-gg8k4 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-7-v9p9w 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-8-x98m7 1/1 Running 1 10m
      volsync-rsync-tls-dst-busybox-pvc-9-wlc99 1/1 Running 0 11m

      amagrawa:c2$ oc get pods -n busybox-workloads-11
      NAME READY STATUS RESTARTS AGE
      volsync-rsync-tls-dst-busybox-pvc-1-clcqg 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-1-jgqfl 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-10-tv6x5 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-11-7gxvh 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-11-9bqzz 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-12-9cghj 1/1 Running 1 7m42s
      volsync-rsync-tls-dst-busybox-pvc-13-vkvd8 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-14-8764l 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-14-vnf6q 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-15-n5pds 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-16-d4v7d 1/1 Running 1 7m45s
      volsync-rsync-tls-dst-busybox-pvc-17-8zqs2 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-17-cv7dv 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-18-vvwkw 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-19-jq9ck 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-19-qx92z 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-2-xhtjw 1/1 Running 0 9m57s
      volsync-rsync-tls-dst-busybox-pvc-20-rwjk9 0/1 ContainerCreating 0 6m24s
      volsync-rsync-tls-dst-busybox-pvc-3-g4tw9 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-3-srwbs 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-4-q8z9w 1/1 Running 0 6m46s
      volsync-rsync-tls-dst-busybox-pvc-5-ddpwg 1/1 Running 1 7m42s
      volsync-rsync-tls-dst-busybox-pvc-6-f62xf 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-7-fn8mn 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-7-sp9gr 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-8-mkl9f 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-9-rdk98 1/1 Running 1 7m43s

      amagrawa:c2$ oc get pods -n busybox-workloads-12
      NAME READY STATUS RESTARTS AGE
      volsync-rsync-tls-dst-busybox-pvc-1-fpx8k 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-10-p2kvq 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-11-pb7bb 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-12-q5hgm 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-13-wj5bk 1/1 Running 0 8m
      volsync-rsync-tls-dst-busybox-pvc-14-jln2s 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-15-l4jsr 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-16-zzlld 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-17-zrlg7 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-18-k98dw 1/1 Running 1 7m35s
      volsync-rsync-tls-dst-busybox-pvc-19-flq4k 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-2-rbkvb 1/1 Running 1 7m46s
      volsync-rsync-tls-dst-busybox-pvc-20-hnsr7 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-3-q8hgc 1/1 Running 0 7m50s
      volsync-rsync-tls-dst-busybox-pvc-4-fqzsp 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-5-8qtc8 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-6-qmplt 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-7-6vnsm 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-8-8hpzn 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-8-qnzpn 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-9-hl8jn 1/1 Running 0 10m

      amagrawa:c2$ oc get pods -n busybox-workloads-13
      NAME READY STATUS RESTARTS AGE
      volsync-rsync-tls-dst-busybox-pvc-1-bsxj9 1/1 Running 0 7m43s
      volsync-rsync-tls-dst-busybox-pvc-10-9k5sl 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-10-fl5qn 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-11-nl4h8 1/1 Running 1 7m46s
      volsync-rsync-tls-dst-busybox-pvc-12-mdkbv 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-12-td9bh 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-13-tnrrw 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-14-jgdkw 1/1 Running 1 7m56s
      volsync-rsync-tls-dst-busybox-pvc-15-vt4bv 1/1 Running 0 8m13s
      volsync-rsync-tls-dst-busybox-pvc-16-thvlj 1/1 Running 2 15m
      volsync-rsync-tls-dst-busybox-pvc-17-njmdw 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-17-nqfz5 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-18-b8xvx 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-19-jmwwv 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-2-dblf2 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-20-wbd77 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-3-tfs4j 1/1 Running 0 10m
      volsync-rsync-tls-dst-busybox-pvc-4-pr4dc 1/1 Running 2 15m
      volsync-rsync-tls-dst-busybox-pvc-5-65mb9 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-5-gpxll 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-6-h9dg9 1/1 Running 2 11m
      volsync-rsync-tls-dst-busybox-pvc-7-nmwhr 1/1 Running 0 11m
      volsync-rsync-tls-dst-busybox-pvc-8-nw77f 0/1 ContainerStatusUnknown 1 2d17h
      volsync-rsync-tls-dst-busybox-pvc-8-qqrl9 0/1 ContainerCreating 0 2d17h
      volsync-rsync-tls-dst-busybox-pvc-9-j9dnk 1/1 Running 0 6m56s

      We see MDS getting crashed along with MDS blocklists on cluster C1-

      From C1-

      amagrawa:~$ crash
      ID ENTITY NEW
      2023-11-16T09:27:56.005749Z_0e6bf806-c030-4015-a480-8472717d78df mds.ocs-storagecluster-cephfilesystem-a
      2023-11-16T09:31:46.585416Z_1ebe91e6-a32f-461c-af98-39e5632683bf mds.ocs-storagecluster-cephfilesystem-a
      2023-11-18T06:48:44.968132Z_d3235ee3-ef54-4f77-97a8-62c57e0e3266 mds.ocs-storagecluster-cephfilesystem-a *
      2023-11-18T15:44:29.157802Z_f4529bb6-3011-4015-9f17-67950e5fd469 mds.ocs-storagecluster-cephfilesystem-a *
      2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0 mds.ocs-storagecluster-cephfilesystem-a *
      2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf mds.ocs-storagecluster-cephfilesystem-a *

      bash-5.1$ ceph crash info 2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54df0) [0x7f83e3828df0]",
      "(std::_Rb_tree_decrement(std::_Rb_tree_node_base*)+0x23) [0x7f83e3b9b0d3]",
      "ceph-mds(+0x1078e5) [0x56481f8778e5]",
      "ceph-mds(+0x4ccea2) [0x56481fc3cea2]",
      "(MDSTableClient::got_journaled_ack(unsigned long)+0x15c) [0x56481faed9fc]",
      "(MDLog::_replay_thread()+0x753) [0x56481fb38923]",
      "ceph-mds(+0x13ff41) [0x56481f8aff41]",
      "/lib64/libc.so.6(+0x9f802) [0x7f83e3873802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f83e3813450]"
      ],
      "ceph_version": "17.2.6-148.el9cp",
      "crash_id": "2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.2 (Plow)",
      "os_version_id": "9.2",
      "process_name": "ceph-mds",
      "stack_sig": "fbb67541618c973da2228cd3e35dfa753d03a6704f9f669fd7d06921c34435bb",
      "timestamp": "2023-11-19T01:55:40.796669Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-9d76bb6dbddft",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
      }
      bash-5.1$ ceph crash info 2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54df0) [0x7f7f01876df0]",
      "(std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)+0x134) [0x7f7f01be9644]",
      "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x3db8) [0x562ab72fc688]",
      "(EUpdate::replay(MDSRank*)+0x5d) [0x562ab730650d]",
      "(MDLog::_replay_thread()+0x753) [0x562ab72b1923]",
      "ceph-mds(+0x13ff41) [0x562ab7028f41]",
      "/lib64/libc.so.6(+0x9f802) [0x7f7f018c1802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f7f01861450]"
      ],
      "ceph_version": "17.2.6-148.el9cp",
      "crash_id": "2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.2 (Plow)",
      "os_version_id": "9.2",
      "process_name": "ceph-mds",
      "stack_sig": "cab1dfc243115315f194002ddce635b37fa0765a0ee7d109f957b984029f3ade",
      "timestamp": "2023-11-18T17:21:09.328424Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-9d76bb6dbddft",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
      }

      amagrawa:~$ blocklist
      10.128.4.62:6801/2448646630 2023-11-22T03:32:45.603326+0000
      10.128.4.62:6800/2448646630 2023-11-22T03:32:45.603326+0000
      10.131.2.29:6801/947461227 2023-11-21T21:07:06.564298+0000
      10.131.2.29:6800/947461227 2023-11-21T21:07:06.564298+0000
      10.128.4.62:6801/4214049925 2023-11-21T16:54:07.267916+0000
      10.128.4.62:6800/4214049925 2023-11-21T16:54:07.267916+0000
      listed 6 entries

      This issue is being discussed here with CephFS eng- https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1700497464642019

      and also with CSI team in case it's relevant to them and to better understand the actual root cause-
      https://chat.google.com/room/AAAAqWkMm2s/5SOd8QUuafM

      Version of all relevant components (if applicable):
      OCP 4.14.0-0.nightly-2023-11-09-204811
      Volsync 0.8.0
      Submariner 0.16.2
      ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2
      odf-multicluster-orchestrator.v4.14.1-rhodf
      ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Refer Description of problem above
      2.
      3.

      Actual results: Some of the dst pods remain stuck for different cephfs workloads due to which data sync can not progress.

      Expected results: Dst pods shouldn't remain stuck and data sync should be able to progress for all cephfs workloads.

      Additional info:

              vshankar@redhat.com Venky Shankar
              amagrawa@redhat.com Aman Agrawal
              Aman Agrawal, Venky Shankar
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: