-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.14
-
None
Description of problem (please be detailed as possible and provide log
snippests): While testing hub recovery on a Regional DR setup, we hit BZ2250152 but we were able to recover it using the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2250152#c6.
However, it was found that data sync stopped for 4 out of 6 cephfs workloads as some of the dst pods for all these workloads remained stuck and couldn't progress-
passive hub-
amagrawa:~$ drpc|grep cephfs
busybox-workloads-12 cephfs-sub-busybox-workloads-12-placement-1-drpc 4d18h amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True
busybox-workloads-13 cephfs-sub-busybox-workloads-13-placement-1-drpc 4d18h amagrawa-10n-1 Relocate Relocated Completed True
busybox-workloads-14 cephfs-sub-busybox-workloads-14-placement-1-drpc 4d18h amagrawa-10n-1 amagrawa-10n-2 Failover FailedOver Completed True
openshift-gitops cephfs-appset-busybox-workloads-10-placement-drpc 4d18h amagrawa-10n-1 amagrawa-10n-1 Failover FailedOver Completed True
openshift-gitops cephfs-appset-busybox-workloads-11-placement-drpc 4d18h amagrawa-10n-1 Relocate Relocated Completed True
openshift-gitops cephfs-appset-busybox-workloads-9-placement-drpc 4d18h amagrawa-10n-2 Relocate Relocated Completed True
amagrawa:~$ group
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-12
namespace: busybox-workloads-12
namespace: busybox-workloads-12
lastGroupSyncTime: "2023-11-18T14:43:43Z"
namespace: busybox-workloads-12
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-13
namespace: busybox-workloads-13
namespace: busybox-workloads-13
lastGroupSyncTime: "2023-11-18T14:37:46Z"
namespace: busybox-workloads-13
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-14
namespace: busybox-workloads-14
namespace: busybox-workloads-14
lastGroupSyncTime: "2023-11-21T07:28:18Z"
namespace: busybox-workloads-14
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-6
namespace: busybox-workloads-6
namespace: busybox-workloads-6
lastGroupSyncTime: "2023-11-21T08:10:01Z"
namespace: busybox-workloads-6
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-7
namespace: busybox-workloads-7
namespace: busybox-workloads-7
lastGroupSyncTime: "2023-11-21T08:10:39Z"
namespace: busybox-workloads-7
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-8
namespace: busybox-workloads-8
namespace: busybox-workloads-8
lastGroupSyncTime: "2023-11-21T08:10:18Z"
namespace: busybox-workloads-8
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-10
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-18T14:34:06Z"
namespace: busybox-workloads-10
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-11
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-18T14:37:44Z"
namespace: busybox-workloads-11
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-9
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-21T07:27:44Z"
namespace: busybox-workloads-9
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-21T08:10:21Z"
namespace: busybox-workloads-1
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-21T08:10:01Z"
namespace: busybox-workloads-2
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-21T08:10:18Z"
namespace: busybox-workloads-3
drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4
namespace: openshift-gitops
namespace: openshift-gitops
lastGroupSyncTime: "2023-11-21T08:10:01Z"
namespace: busybox-workloads-4
amagrawa:~$ date -u
Tuesday 21 November 2023 08:14:38 AM UTC
data sync is working for busybox-workloads-9 and 14 (is behind by just a few hours but it's progressing), however, it has completed stopped for remaining 4 workloads.
From C2-
amagrawa:c2$ oc get pods -n busybox-workloads-10
NAME READY STATUS RESTARTS AGE
volsync-rsync-tls-dst-busybox-pvc-1-tw4tq 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-10-qx29b 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-11-hqwzs 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-11-kwbww 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-12-hqvw5 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-13-rxfdj 1/1 Running 1 7m31s
volsync-rsync-tls-dst-busybox-pvc-14-9lj7b 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-15-hxndg 1/1 Running 0 15m
volsync-rsync-tls-dst-busybox-pvc-16-6gnws 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-17-f579l 1/1 Running 0 7m43s
volsync-rsync-tls-dst-busybox-pvc-18-drxsg 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-19-8vg78 1/1 Running 1 7m19s
volsync-rsync-tls-dst-busybox-pvc-2-bv9jc 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-2-fv94l 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-20-q9828 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-3-jrdzf 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-4-pk6ns 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-5-8nxqg 1/1 Running 2 10m
volsync-rsync-tls-dst-busybox-pvc-6-298fv 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-7-gg8k4 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-7-v9p9w 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-8-x98m7 1/1 Running 1 10m
volsync-rsync-tls-dst-busybox-pvc-9-wlc99 1/1 Running 0 11m
amagrawa:c2$ oc get pods -n busybox-workloads-11
NAME READY STATUS RESTARTS AGE
volsync-rsync-tls-dst-busybox-pvc-1-clcqg 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-1-jgqfl 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-10-tv6x5 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-11-7gxvh 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-11-9bqzz 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-12-9cghj 1/1 Running 1 7m42s
volsync-rsync-tls-dst-busybox-pvc-13-vkvd8 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-14-8764l 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-14-vnf6q 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-15-n5pds 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-16-d4v7d 1/1 Running 1 7m45s
volsync-rsync-tls-dst-busybox-pvc-17-8zqs2 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-17-cv7dv 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-18-vvwkw 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-19-jq9ck 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-19-qx92z 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-2-xhtjw 1/1 Running 0 9m57s
volsync-rsync-tls-dst-busybox-pvc-20-rwjk9 0/1 ContainerCreating 0 6m24s
volsync-rsync-tls-dst-busybox-pvc-3-g4tw9 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-3-srwbs 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-4-q8z9w 1/1 Running 0 6m46s
volsync-rsync-tls-dst-busybox-pvc-5-ddpwg 1/1 Running 1 7m42s
volsync-rsync-tls-dst-busybox-pvc-6-f62xf 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-7-fn8mn 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-7-sp9gr 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-8-mkl9f 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-9-rdk98 1/1 Running 1 7m43s
amagrawa:c2$ oc get pods -n busybox-workloads-12
NAME READY STATUS RESTARTS AGE
volsync-rsync-tls-dst-busybox-pvc-1-fpx8k 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-10-p2kvq 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-11-pb7bb 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-12-q5hgm 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-13-wj5bk 1/1 Running 0 8m
volsync-rsync-tls-dst-busybox-pvc-14-jln2s 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-15-l4jsr 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-16-zzlld 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-17-zrlg7 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-18-k98dw 1/1 Running 1 7m35s
volsync-rsync-tls-dst-busybox-pvc-19-flq4k 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-2-rbkvb 1/1 Running 1 7m46s
volsync-rsync-tls-dst-busybox-pvc-20-hnsr7 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-3-q8hgc 1/1 Running 0 7m50s
volsync-rsync-tls-dst-busybox-pvc-4-fqzsp 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-5-8qtc8 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-6-qmplt 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-7-6vnsm 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-8-8hpzn 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-8-qnzpn 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-9-hl8jn 1/1 Running 0 10m
amagrawa:c2$ oc get pods -n busybox-workloads-13
NAME READY STATUS RESTARTS AGE
volsync-rsync-tls-dst-busybox-pvc-1-bsxj9 1/1 Running 0 7m43s
volsync-rsync-tls-dst-busybox-pvc-10-9k5sl 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-10-fl5qn 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-11-nl4h8 1/1 Running 1 7m46s
volsync-rsync-tls-dst-busybox-pvc-12-mdkbv 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-12-td9bh 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-13-tnrrw 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-14-jgdkw 1/1 Running 1 7m56s
volsync-rsync-tls-dst-busybox-pvc-15-vt4bv 1/1 Running 0 8m13s
volsync-rsync-tls-dst-busybox-pvc-16-thvlj 1/1 Running 2 15m
volsync-rsync-tls-dst-busybox-pvc-17-njmdw 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-17-nqfz5 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-18-b8xvx 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-19-jmwwv 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-2-dblf2 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-20-wbd77 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-3-tfs4j 1/1 Running 0 10m
volsync-rsync-tls-dst-busybox-pvc-4-pr4dc 1/1 Running 2 15m
volsync-rsync-tls-dst-busybox-pvc-5-65mb9 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-5-gpxll 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-6-h9dg9 1/1 Running 2 11m
volsync-rsync-tls-dst-busybox-pvc-7-nmwhr 1/1 Running 0 11m
volsync-rsync-tls-dst-busybox-pvc-8-nw77f 0/1 ContainerStatusUnknown 1 2d17h
volsync-rsync-tls-dst-busybox-pvc-8-qqrl9 0/1 ContainerCreating 0 2d17h
volsync-rsync-tls-dst-busybox-pvc-9-j9dnk 1/1 Running 0 6m56s
We see MDS getting crashed along with MDS blocklists on cluster C1-
From C1-
amagrawa:~$ crash
ID ENTITY NEW
2023-11-16T09:27:56.005749Z_0e6bf806-c030-4015-a480-8472717d78df mds.ocs-storagecluster-cephfilesystem-a
2023-11-16T09:31:46.585416Z_1ebe91e6-a32f-461c-af98-39e5632683bf mds.ocs-storagecluster-cephfilesystem-a
2023-11-18T06:48:44.968132Z_d3235ee3-ef54-4f77-97a8-62c57e0e3266 mds.ocs-storagecluster-cephfilesystem-a *
2023-11-18T15:44:29.157802Z_f4529bb6-3011-4015-9f17-67950e5fd469 mds.ocs-storagecluster-cephfilesystem-a *
2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0 mds.ocs-storagecluster-cephfilesystem-a *
2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf mds.ocs-storagecluster-cephfilesystem-a *
bash-5.1$ ceph crash info 2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7f83e3828df0]",
"(std::_Rb_tree_decrement(std::_Rb_tree_node_base*)+0x23) [0x7f83e3b9b0d3]",
"ceph-mds(+0x1078e5) [0x56481f8778e5]",
"ceph-mds(+0x4ccea2) [0x56481fc3cea2]",
"(MDSTableClient::got_journaled_ack(unsigned long)+0x15c) [0x56481faed9fc]",
"(MDLog::_replay_thread()+0x753) [0x56481fb38923]",
"ceph-mds(+0x13ff41) [0x56481f8aff41]",
"/lib64/libc.so.6(+0x9f802) [0x7f83e3873802]",
"/lib64/libc.so.6(+0x3f450) [0x7f83e3813450]"
],
"ceph_version": "17.2.6-148.el9cp",
"crash_id": "2023-11-19T01:55:40.796669Z_69708b4a-b65e-4069-9d67-f620c81e39bf",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "fbb67541618c973da2228cd3e35dfa753d03a6704f9f669fd7d06921c34435bb",
"timestamp": "2023-11-19T01:55:40.796669Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-9d76bb6dbddft",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}
bash-5.1$ ceph crash info 2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7f7f01876df0]",
"(std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)+0x134) [0x7f7f01be9644]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x3db8) [0x562ab72fc688]",
"(EUpdate::replay(MDSRank*)+0x5d) [0x562ab730650d]",
"(MDLog::_replay_thread()+0x753) [0x562ab72b1923]",
"ceph-mds(+0x13ff41) [0x562ab7028f41]",
"/lib64/libc.so.6(+0x9f802) [0x7f7f018c1802]",
"/lib64/libc.so.6(+0x3f450) [0x7f7f01861450]"
],
"ceph_version": "17.2.6-148.el9cp",
"crash_id": "2023-11-18T17:21:09.328424Z_b44b669f-a3b3-47b3-8852-0d0f50f801b0",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "cab1dfc243115315f194002ddce635b37fa0765a0ee7d109f957b984029f3ade",
"timestamp": "2023-11-18T17:21:09.328424Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-9d76bb6dbddft",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}
amagrawa:~$ blocklist
10.128.4.62:6801/2448646630 2023-11-22T03:32:45.603326+0000
10.128.4.62:6800/2448646630 2023-11-22T03:32:45.603326+0000
10.131.2.29:6801/947461227 2023-11-21T21:07:06.564298+0000
10.131.2.29:6800/947461227 2023-11-21T21:07:06.564298+0000
10.128.4.62:6801/4214049925 2023-11-21T16:54:07.267916+0000
10.128.4.62:6800/4214049925 2023-11-21T16:54:07.267916+0000
listed 6 entries
This issue is being discussed here with CephFS eng- https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1700497464642019
and also with CSI team in case it's relevant to them and to better understand the actual root cause-
https://chat.google.com/room/AAAAqWkMm2s/5SOd8QUuafM
Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-09-204811
Volsync 0.8.0
Submariner 0.16.2
ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2
odf-multicluster-orchestrator.v4.14.1-rhodf
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Refer Description of problem above
2.
3.
Actual results: Some of the dst pods remain stuck for different cephfs workloads due to which data sync can not progress.
Expected results: Dst pods shouldn't remain stuck and data sync should be able to progress for all cephfs workloads.
Additional info: