Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-619

[2317236] ceph-mds process reported crash on ODF 4.17 cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.17
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      ceph-mds process has generated a following crash on 4.17.0-114 cluster, and cephcluster showing HEALTH_WARN state.
      Since the crash happened in the libc.so.6 and backtrace doesn’t provide detailed function names or locations beyond the generic memory addresses.

      sh-5.1$ ceph crash info 2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x3e6f0) [0x7f66b5bea6f0]",
      "[0x5579e7405330]"
      ],
      "ceph_version": "18.2.1-229.el9cp",
      "crash_id": "2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a",
      "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.4 (Plow)",
      "os_version_id": "9.4",
      "process_name": "ceph-mds",
      "stack_sig": "12c4f060cf8b59a0ebac25da63a7f5b2a2cf5b99f12a288248409824102b5615",
      "timestamp": "2024-10-05T06:10:52.621730Z",
      "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-427.37.1.el9_4.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Sep 13 12:41:50 EDT 2024"
      }

      rook-ceph-mds logs
      ==========
      ❯ ocs logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s
      Defaulted container "mds" out of: mds, log-collector, chown-container-data-dir (init)
      debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 set uid:gid to 167:167 (ceph:ceph)
      debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable), process ceph-mds, pid 151
      debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 1 main not setting numa affinity
      debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 pidfile_write: ignore empty --pid-file
      starting mds.ocs-storagecluster-cephfilesystem-b at
      debug 2024-10-05T06:10:52.970+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 35 from mon.0
      debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 36 from mon.0
      debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby.
      debug 2024-10-05T06:11:42.402+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 41 from mon.0
      debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map i am now mds.74241.0 replaying mds.0.0
      debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map state change up:standby --> up:standby-replay
      debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 replay_start
      debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 waiting for osdmap 127 (which blocklists prior instance)
      debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x100
      debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x1
      debug 2024-10-06T00:07:31.076+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39092) UID: 0
      debug 2024-10-06T00:07:31.080+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39093) UID: 0
      debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 Fail to open '/proc/91114/cmdline' error = (2) No such file or directory
      debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 received signal: Hangup from <unknown> (PID: 91114) UID: 0
      debug 2024-10-07T00:07:31.511+0000 7f528c40b640 -1 received signal: Hangup from (PID: 91115) UID: 0

      Cephcluster is in HEALTH_WARN state
      =========-=
      ❯ ocs get cephclusters.ceph.rook.io
      NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
      ocs-storagecluster-cephcluster /var/lib/rook 3 3d21h Ready Cluster created successfully HEALTH_WARN 6b3f9622-7cbd-44b0-9991-4c75c6f9cf39

      Version of all relevant components (if applicable): 4.17

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)? Y

      Is there any workaround available to the best of your knowledge? N

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible? N

      Can this issue reproduce from the UI? N

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Deploy ODF 4.17.0-114 cluster
      2. Create a 4 PVC with cephFS interface
      3. Attach a PVC to pods and start FIO workload from each pod
      4. wait till 3-4 minuters
      5. POwerOff one Worker node from vcenter and wait 120 seconds minute.
      6. POwerOn same worker node and wait till node join to the cluster

      Actual results: After PowerOn worker node has joined but the cephcluster showing as HEALTH_WARN state and ceph-mds process has generated a crash

      Expected results: When the node rejoins the cluster, all operations are expected to work

      Additional info:

      Must Gather logs : https://ibm.box.com/s/vxanlqhr461m82gafl3984a3awtsrlso

              mchangir@redhat.com Milind Changire (Inactive)
              pakamble Parag Kamble
              Parag Kamble
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: