Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.16
Component/s: ceph/CephFS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2282346
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.21
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):

ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
OCP 4.16.0-0.nightly-2024-04-26-145258
ODF 4.16.0-89.stable
ACM 2.10.2
MCE 2.5.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

****Active hub co-situated with primary managed cluster****

1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
3. After successful failover, recover the down managed cluster.
4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now. Successful failover should be possible after BZ2283038 is fixed.

During all these operations, keep an eye on health of MDS pods and crashes if any.

Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover.

pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1023 (9m33s ago) 7d13h 10.128.2.63 compute-2 <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1006 (17m ago) 7d13h 10.131.0.241 compute-1 <none> <none>

oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls
ID ENTITY NEW
2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669 mds.ocs-storagecluster-cephfilesystem-a
2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 mds.ocs-storagecluster-cephfilesystem-b
2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3 mds.ocs-storagecluster-cephfilesystem-a *
2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65 mds.ocs-storagecluster-cephfilesystem-b *
2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f mds.ocs-storagecluster-cephfilesystem-a *
2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd mds.ocs-storagecluster-cephfilesystem-b *
2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 mds.ocs-storagecluster-cephfilesystem-a *
2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf mds.ocs-storagecluster-cephfilesystem-b *
2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b mds.ocs-storagecluster-cephfilesystem-b *
2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9 mds.ocs-storagecluster-cephfilesystem-b *
2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21 mds.ocs-storagecluster-cephfilesystem-b *
2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 mds.ocs-storagecluster-cephfilesystem-a *
2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba mds.ocs-storagecluster-cephfilesystem-a *
2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40 mds.ocs-storagecluster-cephfilesystem-b *
2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb mds.ocs-storagecluster-cephfilesystem-a *
2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a mds.ocs-storagecluster-cephfilesystem-b *
2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f mds.ocs-storagecluster-cephfilesystem-b *
2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08 mds.ocs-storagecluster-cephfilesystem-b *
2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7 mds.ocs-storagecluster-cephfilesystem-a *

bash-5.1$ ceph crash info 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669
{
"archived": "2024-05-14 19:09:39.585770",
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7f1a1abf3db0]",
"(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x56062dc65c43]",
"(MDLog::_replay_thread()+0x75e) [0x56062dcb852e]",
"ceph-mds(+0x16cf21) [0x56062da01f21]",
"/lib64/libc.so.6(+0x9f802) [0x7f1a1ac3e802]",
"/lib64/libc.so.6(+0x3f450) [0x7f1a1abde450]"
],
"ceph_version": "18.2.1-136.el9cp",
"crash_id": "2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.3 (Plow)",
"os_version_id": "9.3",
"process_name": "ceph-mds",
"stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
"timestamp": "2024-05-14T17:33:05.811016Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfd6f5m4",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}

bash-5.1$ ceph crash info 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3
{
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7f60d8013db0]",
"(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55daad309c43]",
"(MDLog::_replay_thread()+0x75e) [0x55daad35c52e]",
"ceph-mds(+0x16cf21) [0x55daad0a5f21]",
"/lib64/libc.so.6(+0x9f802) [0x7f60d805e802]",
"/lib64/libc.so.6(+0x3f450) [0x7f60d7ffe450]"
],
"ceph_version": "18.2.1-136.el9cp",
"crash_id": "2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.3 (Plow)",
"os_version_id": "9.3",
"process_name": "ceph-mds",
"stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
"timestamp": "2024-05-15T11:11:58.726060Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}

bash-5.1$ ceph crash info 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65
{
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7f7cbf29bdb0]",
"(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55bd68c5dc43]",
"(MDLog::_replay_thread()+0x75e) [0x55bd68cb052e]",
"ceph-mds(+0x16cf21) [0x55bd689f9f21]",
"/lib64/libc.so.6(+0x9f802) [0x7f7cbf2e6802]",
"/lib64/libc.so.6(+0x3f450) [0x7f7cbf286450]"
],
"ceph_version": "18.2.1-136.el9cp",
"crash_id": "2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.3 (Plow)",
"os_version_id": "9.3",
"process_name": "ceph-mds",
"stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
"timestamp": "2024-05-15T12:26:15.553277Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"

bash-5.1$ ceph crash info 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f
{
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7f8e450e2db0]",
"(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x5616ea8d8c43]",
"(MDLog::_replay_thread()+0x75e) [0x5616ea92b52e]",
"ceph-mds(+0x16cf21) [0x5616ea674f21]",
"/lib64/libc.so.6(+0x9f802) [0x7f8e4512d802]",
"/lib64/libc.so.6(+0x3f450) [0x7f8e450cd450]"
],
"ceph_version": "18.2.1-136.el9cp",
"crash_id": "2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.3 (Plow)",
"os_version_id": "9.3",
"process_name": "ceph-mds",
"stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
"timestamp": "2024-05-16T07:07:03.233208Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}

Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/

Expected results: MDS shouldn't unexpectedly crash when cluster isn't heavily loaded and failover is performed.

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty