Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-572

[2282346] [RDR] Multiple MDS crashes seen on the surviving cluster post hub recovery

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.16
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):

      ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
      OCP 4.16.0-0.nightly-2024-04-26-145258
      ODF 4.16.0-89.stable
      ACM 2.10.2
      MCE 2.5.2

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      ****Active hub co-situated with primary managed cluster****

      1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
      2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
      3. After successful failover, recover the down managed cluster.
      4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now. Successful failover should be possible after BZ2283038 is fixed.

      During all these operations, keep an eye on health of MDS pods and crashes if any.

      Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover.

      pods|grep mds
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1023 (9m33s ago) 7d13h 10.128.2.63 compute-2 <none> <none>
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1006 (17m ago) 7d13h 10.131.0.241 compute-1 <none> <none>

      oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls
      ID ENTITY NEW
      2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669 mds.ocs-storagecluster-cephfilesystem-a
      2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 mds.ocs-storagecluster-cephfilesystem-b
      2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3 mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65 mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9 mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21 mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40 mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb mds.ocs-storagecluster-cephfilesystem-a *
      2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08 mds.ocs-storagecluster-cephfilesystem-b *
      2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7 mds.ocs-storagecluster-cephfilesystem-a *

      bash-5.1$ ceph crash info 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669
      {
      "archived": "2024-05-14 19:09:39.585770",
      "backtrace": [
      "/lib64/libc.so.6(+0x54db0) [0x7f1a1abf3db0]",
      "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x56062dc65c43]",
      "(MDLog::_replay_thread()+0x75e) [0x56062dcb852e]",
      "ceph-mds(+0x16cf21) [0x56062da01f21]",
      "/lib64/libc.so.6(+0x9f802) [0x7f1a1ac3e802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f1a1abde450]"
      ],
      "ceph_version": "18.2.1-136.el9cp",
      "crash_id": "2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.3 (Plow)",
      "os_version_id": "9.3",
      "process_name": "ceph-mds",
      "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
      "timestamp": "2024-05-14T17:33:05.811016Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfd6f5m4",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
      }

      bash-5.1$ ceph crash info 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54db0) [0x7f60d8013db0]",
      "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55daad309c43]",
      "(MDLog::_replay_thread()+0x75e) [0x55daad35c52e]",
      "ceph-mds(+0x16cf21) [0x55daad0a5f21]",
      "/lib64/libc.so.6(+0x9f802) [0x7f60d805e802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f60d7ffe450]"
      ],
      "ceph_version": "18.2.1-136.el9cp",
      "crash_id": "2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.3 (Plow)",
      "os_version_id": "9.3",
      "process_name": "ceph-mds",
      "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
      "timestamp": "2024-05-15T11:11:58.726060Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
      }

      bash-5.1$ ceph crash info 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54db0) [0x7f7cbf29bdb0]",
      "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55bd68c5dc43]",
      "(MDLog::_replay_thread()+0x75e) [0x55bd68cb052e]",
      "ceph-mds(+0x16cf21) [0x55bd689f9f21]",
      "/lib64/libc.so.6(+0x9f802) [0x7f7cbf2e6802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f7cbf286450]"
      ],
      "ceph_version": "18.2.1-136.el9cp",
      "crash_id": "2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.3 (Plow)",
      "os_version_id": "9.3",
      "process_name": "ceph-mds",
      "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
      "timestamp": "2024-05-15T12:26:15.553277Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"

      bash-5.1$ ceph crash info 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54db0) [0x7f8e450e2db0]",
      "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x5616ea8d8c43]",
      "(MDLog::_replay_thread()+0x75e) [0x5616ea92b52e]",
      "ceph-mds(+0x16cf21) [0x5616ea674f21]",
      "/lib64/libc.so.6(+0x9f802) [0x7f8e4512d802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f8e450cd450]"
      ],
      "ceph_version": "18.2.1-136.el9cp",
      "crash_id": "2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.3 (Plow)",
      "os_version_id": "9.3",
      "process_name": "ceph-mds",
      "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
      "timestamp": "2024-05-16T07:07:03.233208Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
      }

      Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/

      Expected results: MDS shouldn't unexpectedly crash when cluster isn't heavily loaded and failover is performed.

      Additional info:

              vshankar@redhat.com Venky Shankar
              amagrawa@redhat.com Aman Agrawal
              Venky Shankar
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: