Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.13
Component/s: ceph/CephFS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2218759
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.18
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests): The issue was seen on one of the managed clusters of a RDR setup where cephfs based workloads are running (Cluster C2).

Version of all relevant components (if applicable):
ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)
ODF 4.13.0-219.snaptrim
ACM 2.8 Ga'ed, Submariner v0.15.0-rc1 (globalnet enabled)
OCP 4.13.0-0.nightly-2023-06-20-224158

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. On a RDR setup, run cephfs based workloads for a few days. Keep monitoring crash reports and ceph health.
2.
3.

Actual results: Mds crash leads to ceph in warning state

The ODF and OCP must gather logs are kept here-
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29jun23/

===========================================================================
The ODF must gather logs has crash events collected for mds on compute-2.
===========================================================================

amagrawa:~$ ceph
cluster:
id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
health: HEALTH_WARN
2 daemons have recently crashed

services:
mon: 3 daemons, quorum d,e,f (age 38h)
mgr: a(active, since 38h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 38h), 3 in (since 3d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 823.68k objects, 375 GiB
usage: 1.1 TiB used, 405 GiB / 1.5 TiB avail
pgs: 169 active+clean

io:
client: 43 MiB/s rd, 15 MiB/s wr, 5.26k op/s rd, 34 op/s wr

amagrawa:~$ crash
ID ENTITY NEW
2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b mds.ocs-storagecluster-cephfilesystem-a *
2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d mds.ocs-storagecluster-cephfilesystem-a *

amagrawa:~$ blocklist
10.129.2.30:0/1012624902 2023-06-29T15:56:41.028232+0000
10.129.2.30:6801/2352278496 2023-06-29T15:56:41.028232+0000
10.129.2.30:0/1404853102 2023-06-29T15:56:41.028232+0000
10.131.0.38:6801/1961824421 2023-06-29T15:56:35.502621+0000
10.129.2.30:0/1825081866 2023-06-29T15:56:41.028232+0000
10.131.0.38:6800/1961824421 2023-06-29T15:56:35.502621+0000
10.129.2.30:0/2079859022 2023-06-29T15:56:41.028232+0000
10.129.2.30:6800/2352278496 2023-06-29T15:56:41.028232+0000
10.129.2.30:0/1739121923 2023-06-29T15:56:41.028232+0000
listed 9 entries

amagrawa:~$ pods|grep 10.129.2.30
rook-ceph-mgr-a-57d775c46d-ztqc7 2/2 Running 3 (21h ago) 2d23h 10.129.2.30 compute-1 <none> <none>

amagrawa:~$ pods|grep 10.131.0.38
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cb69847b25mtw 2/2 Running 3 (21h ago) 2d23h 10.131.0.38 compute-0 <none> <none>

amagrawa:~$ ceph
cluster:
id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
health: HEALTH_WARN
2 daemons have recently crashed

services:
mon: 3 daemons, quorum d,e,f (age 21h)
mgr: a(active, since 21h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 21h), 3 in (since 2d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 772.61k objects, 318 GiB
usage: 959 GiB used, 577 GiB / 1.5 TiB avail
pgs: 169 active+clean

io:
client: 85 MiB/s rd, 54 MiB/s wr, 4.21k op/s rd, 46 op/s wr

bash-5.1$ ceph crash info 2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7f8ca8af9df0]",
"ceph-mds(+0x2a18bb) [0x55b08b0568bb]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
"ceph-mds(+0x4facc2) [0x55b08b2afcc2]",
"(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x89) [0x55b08b1c29c9]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x55b08b1c7e33]",
"(EUpdate::replay(MDSRank*)+0x40) [0x55b08b1d52b0]",
"(MDLog::_replay_thread()+0x753) [0x55b08b180623]",
"ceph-mds(+0x1416d1) [0x55b08aef66d1]",
"/lib64/libc.so.6(+0x9f802) [0x7f8ca8b44802]",
"/lib64/libc.so.6(+0x3f450) [0x7f8ca8ae4450]"
],
"ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
"crash_id": "2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "c92ab69b737674ea8db895f8fc033a66ea94156e5feda28d629549d09827fc47",
"timestamp": "2023-06-27T12:52:39.845901Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
}
bash-5.1$ ceph crash info 2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7fbdba989df0]",
"ceph-mds(+0x4facbf) [0x5632e26c3cbf]",
"(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x5632e25d6991]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x5632e25dbe33]",
"(EOpen::replay(MDSRank*)+0x4f) [0x5632e25e702f]",
"(MDLog::_replay_thread()+0x753) [0x5632e2594623]",
"ceph-mds(+0x1416d1) [0x5632e230a6d1]",
"/lib64/libc.so.6(+0x9f802) [0x7fbdba9d4802]",
"/lib64/libc.so.6(+0x3f450) [0x7fbdba974450]"
],
"ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
"crash_id": "2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mds",
"stack_sig": "7ef4f918b0d4512e2dbdb56443deb2952c421234f56685461d5354f3776f1013",
"timestamp": "2023-06-27T13:20:51.830037Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
}