-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.13
-
None
Description of problem (please be detailed as possible and provide log
snippests): The issue was seen on one of the managed clusters of a RDR setup where cephfs based workloads are running (Cluster C2).
Version of all relevant components (if applicable):
ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)
ODF 4.13.0-219.snaptrim
ACM 2.8 Ga'ed, Submariner v0.15.0-rc1 (globalnet enabled)
OCP 4.13.0-0.nightly-2023-06-20-224158
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. On a RDR setup, run cephfs based workloads for a few days. Keep monitoring crash reports and ceph health.
2.
3.
Actual results: Mds crash leads to ceph in warning state
The ODF and OCP must gather logs are kept here-
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29jun23/
===========================================================================
The ODF must gather logs has crash events collected for mds on compute-2.
===========================================================================
amagrawa:~$ ceph
cluster:
id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
health: HEALTH_WARN
2 daemons have recently crashed
services:
mon: 3 daemons, quorum d,e,f (age 38h)
mgr: a(active, since 38h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 38h), 3 in (since 3d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 823.68k objects, 375 GiB
usage: 1.1 TiB used, 405 GiB / 1.5 TiB avail
pgs: 169 active+clean
io:
client: 43 MiB/s rd, 15 MiB/s wr, 5.26k op/s rd, 34 op/s wr
amagrawa:~$ crash
ID ENTITY NEW
2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b mds.ocs-storagecluster-cephfilesystem-a *
2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d mds.ocs-storagecluster-cephfilesystem-a *
amagrawa:~$ blocklist
10.129.2.30:0/1012624902 2023-06-29T15:56:41.028232+0000
10.129.2.30:6801/2352278496 2023-06-29T15:56:41.028232+0000
10.129.2.30:0/1404853102 2023-06-29T15:56:41.028232+0000
10.131.0.38:6801/1961824421 2023-06-29T15:56:35.502621+0000
10.129.2.30:0/1825081866 2023-06-29T15:56:41.028232+0000
10.131.0.38:6800/1961824421 2023-06-29T15:56:35.502621+0000
10.129.2.30:0/2079859022 2023-06-29T15:56:41.028232+0000
10.129.2.30:6800/2352278496 2023-06-29T15:56:41.028232+0000
10.129.2.30:0/1739121923 2023-06-29T15:56:41.028232+0000
listed 9 entries
amagrawa:~$ pods|grep 10.129.2.30
rook-ceph-mgr-a-57d775c46d-ztqc7 2/2 Running 3 (21h ago) 2d23h 10.129.2.30 compute-1 <none> <none>
amagrawa:~$ pods|grep 10.131.0.38
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cb69847b25mtw 2/2 Running 3 (21h ago) 2d23h 10.131.0.38 compute-0 <none> <none>
amagrawa:~$ ceph
cluster:
id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
health: HEALTH_WARN
2 daemons have recently crashed
services:
mon: 3 daemons, quorum d,e,f (age 21h)
mgr: a(active, since 21h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 21h), 3 in (since 2d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 772.61k objects, 318 GiB
usage: 959 GiB used, 577 GiB / 1.5 TiB avail
pgs: 169 active+clean
io:
client: 85 MiB/s rd, 54 MiB/s wr, 4.21k op/s rd, 46 op/s wr
bash-5.1$ ceph crash info 2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7f8ca8af9df0]",
"ceph-mds(+0x2a18bb) [0x55b08b0568bb]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x4facc2) [0x55b08b2afcc2]",
"(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x89) [0x55b08b1c29c9]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x55b08b1c7e33]",
"(EUpdate::replay(MDSRank*)+0x40) [0x55b08b1d52b0]",
"(MDLog::_replay_thread()+0x753) [0x55b08b180623]",
"ceph-mds(+0x1416d1) [0x55b08aef66d1]",
"/lib64/libc.so.6(+0x9f802) [0x7f8ca8b44802]",
"/lib64/libc.so.6(+0x3f450) [0x7f8ca8ae4450]"
],
"ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
"crash_id": "2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "c92ab69b737674ea8db895f8fc033a66ea94156e5feda28d629549d09827fc47",
"timestamp": "2023-06-27T12:52:39.845901Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
}
bash-5.1$ ceph crash info 2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7fbdba989df0]",
"ceph-mds(+0x4facbf) [0x5632e26c3cbf]",
"(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x5632e25d6991]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x5632e25dbe33]",
"(EOpen::replay(MDSRank*)+0x4f) [0x5632e25e702f]",
"(MDLog::_replay_thread()+0x753) [0x5632e2594623]",
"ceph-mds(+0x1416d1) [0x5632e230a6d1]",
"/lib64/libc.so.6(+0x9f802) [0x7fbdba9d4802]",
"/lib64/libc.so.6(+0x3f450) [0x7fbdba974450]"
],
"ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
"crash_id": "2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "7ef4f918b0d4512e2dbdb56443deb2952c421234f56685461d5354f3776f1013",
"timestamp": "2023-06-27T13:20:51.830037Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
}
Expected results: Mds crash shouldn't be reported while running IOs
Additional info:
I am sharing the live cluster details to help debug further if needed.
Web Console: https://console-openshift-console.apps.amagrawa-c2.qe.rh-ocs.com
Login: kubeadmin
Password: DZADJ-jJnFL-Wfmwm-cMmha
- external trackers