Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-691

[2218759] [RDR][Tracker][cephFS] Mds crash reported while running IOs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.13
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests): The issue was seen on one of the managed clusters of a RDR setup where cephfs based workloads are running (Cluster C2).

      Version of all relevant components (if applicable):
      ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)
      ODF 4.13.0-219.snaptrim
      ACM 2.8 Ga'ed, Submariner v0.15.0-rc1 (globalnet enabled)
      OCP 4.13.0-0.nightly-2023-06-20-224158

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. On a RDR setup, run cephfs based workloads for a few days. Keep monitoring crash reports and ceph health.
      2.
      3.

      Actual results: Mds crash leads to ceph in warning state

      The ODF and OCP must gather logs are kept here-
      http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29jun23/

      ===========================================================================
      The ODF must gather logs has crash events collected for mds on compute-2.
      ===========================================================================

      amagrawa:~$ ceph
      cluster:
      id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
      health: HEALTH_WARN
      2 daemons have recently crashed

      services:
      mon: 3 daemons, quorum d,e,f (age 38h)
      mgr: a(active, since 38h)
      mds: 1/1 daemons up, 1 hot standby
      osd: 3 osds: 3 up (since 38h), 3 in (since 3d)
      rbd-mirror: 1 daemon active (1 hosts)
      rgw: 1 daemon active (1 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 12 pools, 169 pgs
      objects: 823.68k objects, 375 GiB
      usage: 1.1 TiB used, 405 GiB / 1.5 TiB avail
      pgs: 169 active+clean

      io:
      client: 43 MiB/s rd, 15 MiB/s wr, 5.26k op/s rd, 34 op/s wr

      amagrawa:~$ crash
      ID ENTITY NEW
      2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b mds.ocs-storagecluster-cephfilesystem-a *
      2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d mds.ocs-storagecluster-cephfilesystem-a *

      amagrawa:~$ blocklist
      10.129.2.30:0/1012624902 2023-06-29T15:56:41.028232+0000
      10.129.2.30:6801/2352278496 2023-06-29T15:56:41.028232+0000
      10.129.2.30:0/1404853102 2023-06-29T15:56:41.028232+0000
      10.131.0.38:6801/1961824421 2023-06-29T15:56:35.502621+0000
      10.129.2.30:0/1825081866 2023-06-29T15:56:41.028232+0000
      10.131.0.38:6800/1961824421 2023-06-29T15:56:35.502621+0000
      10.129.2.30:0/2079859022 2023-06-29T15:56:41.028232+0000
      10.129.2.30:6800/2352278496 2023-06-29T15:56:41.028232+0000
      10.129.2.30:0/1739121923 2023-06-29T15:56:41.028232+0000
      listed 9 entries

      amagrawa:~$ pods|grep 10.129.2.30
      rook-ceph-mgr-a-57d775c46d-ztqc7 2/2 Running 3 (21h ago) 2d23h 10.129.2.30 compute-1 <none> <none>

      amagrawa:~$ pods|grep 10.131.0.38
      rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-cb69847b25mtw 2/2 Running 3 (21h ago) 2d23h 10.131.0.38 compute-0 <none> <none>

      amagrawa:~$ ceph
      cluster:
      id: e01d27f1-af46-464f-ba8e-20d7a6f613a2
      health: HEALTH_WARN
      2 daemons have recently crashed

      services:
      mon: 3 daemons, quorum d,e,f (age 21h)
      mgr: a(active, since 21h)
      mds: 1/1 daemons up, 1 hot standby
      osd: 3 osds: 3 up (since 21h), 3 in (since 2d)
      rbd-mirror: 1 daemon active (1 hosts)
      rgw: 1 daemon active (1 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 12 pools, 169 pgs
      objects: 772.61k objects, 318 GiB
      usage: 959 GiB used, 577 GiB / 1.5 TiB avail
      pgs: 169 active+clean

      io:
      client: 85 MiB/s rd, 54 MiB/s wr, 4.21k op/s rd, 46 op/s wr

      bash-5.1$ ceph crash info 2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54df0) [0x7f8ca8af9df0]",
      "ceph-mds(+0x2a18bb) [0x55b08b0568bb]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x2a18ca) [0x55b08b0568ca]",
      "ceph-mds(+0x4facc2) [0x55b08b2afcc2]",
      "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x89) [0x55b08b1c29c9]",
      "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x55b08b1c7e33]",
      "(EUpdate::replay(MDSRank*)+0x40) [0x55b08b1d52b0]",
      "(MDLog::_replay_thread()+0x753) [0x55b08b180623]",
      "ceph-mds(+0x1416d1) [0x55b08aef66d1]",
      "/lib64/libc.so.6(+0x9f802) [0x7f8ca8b44802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f8ca8ae4450]"
      ],
      "ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
      "crash_id": "2023-06-27T12:52:39.845901Z_67d348d2-a869-414e-b979-d9d2e61ced8b",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.2 (Plow)",
      "os_version_id": "9.2",
      "process_name": "ceph-mds",
      "stack_sig": "c92ab69b737674ea8db895f8fc033a66ea94156e5feda28d629549d09827fc47",
      "timestamp": "2023-06-27T12:52:39.845901Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
      }
      bash-5.1$ ceph crash info 2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54df0) [0x7fbdba989df0]",
      "ceph-mds(+0x4facbf) [0x5632e26c3cbf]",
      "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x5632e25d6991]",
      "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x7e3) [0x5632e25dbe33]",
      "(EOpen::replay(MDSRank*)+0x4f) [0x5632e25e702f]",
      "(MDLog::_replay_thread()+0x753) [0x5632e2594623]",
      "ceph-mds(+0x1416d1) [0x5632e230a6d1]",
      "/lib64/libc.so.6(+0x9f802) [0x7fbdba9d4802]",
      "/lib64/libc.so.6(+0x3f450) [0x7fbdba974450]"
      ],
      "ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
      "crash_id": "2023-06-27T13:20:51.830037Z_eead41ef-87fa-4c53-b4c4-110f8512ef4d",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.2 (Plow)",
      "os_version_id": "9.2",
      "process_name": "ceph-mds",
      "stack_sig": "7ef4f918b0d4512e2dbdb56443deb2952c421234f56685461d5354f3776f1013",
      "timestamp": "2023-06-27T13:20:51.830037Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58c865876p6cm",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-284.18.1.el9_2.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023"
      }

      Expected results: Mds crash shouldn't be reported while running IOs

      Additional info:
      I am sharing the live cluster details to help debug further if needed.

      http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/amagrawa-c2/amagrawa-c2_20230626T104958/openshift-cluster-dir/auth/kubeconfig

      Web Console: https://console-openshift-console.apps.amagrawa-c2.qe.rh-ocs.com
      Login: kubeadmin
      Password: DZADJ-jJnFL-Wfmwm-cMmha

              mchangir@redhat.com Milind Changire (Inactive)
              amagrawa@redhat.com Aman Agrawal
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: