-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.17
-
None
Description of problem (please be detailed as possible and provide log
snippests):
ceph-mds process has generated a following crash on 4.17.0-114 cluster, and cephcluster showing HEALTH_WARN state.
Since the crash happened in the libc.so.6 and backtrace doesn’t provide detailed function names or locations beyond the generic memory addresses.
sh-5.1$ ceph crash info 2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a
{
"backtrace": [
"/lib64/libc.so.6(+0x3e6f0) [0x7f66b5bea6f0]",
"[0x5579e7405330]"
],
"ceph_version": "18.2.1-229.el9cp",
"crash_id": "2024-10-05T06:10:52.621730Z_c7f22b45-d236-43ea-86a4-aa19b31c380a",
"entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.4 (Plow)",
"os_version_id": "9.4",
"process_name": "ceph-mds",
"stack_sig": "12c4f060cf8b59a0ebac25da63a7f5b2a2cf5b99f12a288248409824102b5615",
"timestamp": "2024-10-05T06:10:52.621730Z",
"utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-427.37.1.el9_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Sep 13 12:41:50 EDT 2024"
}
rook-ceph-mds logs
==========
❯ ocs logs rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d77bb68ccw6s
Defaulted container "mds" out of: mds, log-collector, chown-container-data-dir (init)
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 set uid:gid to 167:167 (ceph:ceph)
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable), process ceph-mds, pid 151
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 1 main not setting numa affinity
debug 2024-10-05T06:10:52.959+0000 7f528f474ac0 0 pidfile_write: ignore empty --pid-file
starting mds.ocs-storagecluster-cephfilesystem-b at
debug 2024-10-05T06:10:52.970+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 35 from mon.0
debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 36 from mon.0
debug 2024-10-05T06:10:52.991+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Monitors have assigned me to become a standby.
debug 2024-10-05T06:11:42.402+0000 7f528ac08640 1 mds.ocs-storagecluster-cephfilesystem-b Updating MDS map to version 41 from mon.0
debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map i am now mds.74241.0 replaying mds.0.0
debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 handle_mds_map state change up:standby --> up:standby-replay
debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 replay_start
debug 2024-10-05T06:11:42.403+0000 7f528ac08640 1 mds.0.0 waiting for osdmap 127 (which blocklists prior instance)
debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x100
debug 2024-10-05T06:11:42.451+0000 7f5284bfc640 0 mds.0.cache creating system inode with ino:0x1
debug 2024-10-06T00:07:31.076+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39092) UID: 0
debug 2024-10-06T00:07:31.080+0000 7f528c40b640 -1 received signal: Hangup from (PID: 39093) UID: 0
debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 Fail to open '/proc/91114/cmdline' error = (2) No such file or directory
debug 2024-10-07T00:07:31.508+0000 7f528c40b640 -1 received signal: Hangup from <unknown> (PID: 91114) UID: 0
debug 2024-10-07T00:07:31.511+0000 7f528c40b640 -1 received signal: Hangup from (PID: 91115) UID: 0
Cephcluster is in HEALTH_WARN state
=========-=
❯ ocs get cephclusters.ceph.rook.io
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
ocs-storagecluster-cephcluster /var/lib/rook 3 3d21h Ready Cluster created successfully HEALTH_WARN 6b3f9622-7cbd-44b0-9991-4c75c6f9cf39
Version of all relevant components (if applicable): 4.17
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Y
Is there any workaround available to the best of your knowledge? N
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible? N
Can this issue reproduce from the UI? N
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Deploy ODF 4.17.0-114 cluster
2. Create a 4 PVC with cephFS interface
3. Attach a PVC to pods and start FIO workload from each pod
4. wait till 3-4 minuters
5. POwerOff one Worker node from vcenter and wait 120 seconds minute.
6. POwerOn same worker node and wait till node join to the cluster
Actual results: After PowerOn worker node has joined but the cephcluster showing as HEALTH_WARN state and ceph-mds process has generated a crash
Expected results: When the node rejoins the cluster, all operations are expected to work
Additional info:
Must Gather logs : https://ibm.box.com/s/vxanlqhr461m82gafl3984a3awtsrlso