-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.12
-
None
Description of problem (please be detailed as possible and provide log
snippets):
All three OSDs are in CLBO, coredumps are being generated on each node. Journalctl logs show an exit code of 139.
Customer is on the IBM Z/s390x platform.
Journalctl logs from storage3 node:
—
052fe21dd149c734 description=openshift-storage/rook-ceph-osd-1-74b6bcb5d9-8dtjf/activate id=fc79bebb-d698-4502-82f5-fa43441fb571 name=/runtime.v1.RuntimeService/StartContainer sandboxID=2c3b4d8c5663ebce3e5cef7665
9dbd2af5174a2ca5b793c4d3d895328b703723
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User process fault: interruption code 0011 ilc:3 in ld-2.28.so[3ff99f00000+29000]
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Failing address: 000003ffcd37f000 TEID: 000003ffcd37f400
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Fault in primary space mode while using user ASCE.
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: AS:00000009916041c7 R3:0000000972f24007 S:0000000a8d87d000 P:0000000000000400
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: CPU: 4 PID: 676 Comm: ceph-bluestore- Not tainted 4.18.0-372.96.1.el8_6.s390x #1
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Hardware name: IBM 3931 A01 508 (KVM/Linux)
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User PSW : 0705200180000000 000003ff99f129a0
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User GPRS: 0000000000000000 000003ff99f68800 0000000000000002 0000000000000002
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99f69010 000003ff99bd3d20 000000000000000c 000003ff99f69010
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99f690d0 000000000000000c 000003ff99f6a280 000003ff99bd3d20
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99bd2e40 000003ff99bc16c8 000003ff99f12bcc 000003ffcd37ff48
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User Code: 000003ff99f1299a: 0707 bcr 0,%r7
000003ff99f1299c: 0707 bcr 0,%r7
#000003ff99f1299e: 0707 bcr 0,%r7
>000003ff99f129a0: eb6ff0300024 stmg %r6,%r15,48(%r15)
000003ff99f129a6: b90400ef lgr %r14,%r15
000003ff99f129aa: e3f0ff30ff71 lay %r15,-208(%r15)
000003ff99f129b0: a7ebfff0 aghi %r14,-16
000003ff99f129b4: 6080e000 std %f8,0(%r14)
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Last Breaking-Event-Address:
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: [<000003ff99f12bc6>] 0x3ff99f12bc6
Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: Started Process Core Dump (PID 700/UID 0).
Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us kubenswrapper[2587]: I0716 20:20:38.261860 2587 kubelet.go:2157] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-osd-1-74b6bcb5d9-8d
tjf" event=&
Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us systemd-coredump[705]: Removed old coredump core.ceph-mon.0.166d167cfeda4c6bb0eb4da9ae845864.2814806.1721092159000000.lz4.
Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: run-runc-7e44fd45ce8b1995ae78f248705aadc728c51bb655b1e16a0c7fcd4c14026923-runc.2Oc8v3.mount: Succeeded.
Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd-coredump[705]: Process 676 (ceph-bluestore-) of user 0 dumped core.
Stack trace of thread 245:
#0 0x000003ff99f129a0 n/a (/usr/lib64/ld-2.28.so)
Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: systemd-coredump@12376-700-0.service: Succeeded.
Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: systemd-coredump@12376-700-0.service: Consumed 1.013s CPU time
Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us conmon[657]: conmon d700d73bea45364addbd <ninfo>: container 676 exited with status 139
—
I attempted to analyze the coredumps in a s390x lab for storage3, unsure if I was successful:
—
[root@s390x-kvm-056 coredumps]# uname -a
Linux s390x-kvm-056.lab.eng.rdu2.redhat.com 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux
[root@s390x-kvm-056 coredumps]# podman run -it -v ./:/mnt:Z --entrypoint=/bin/bash registry.redhat.io/rhceph/rhceph-5-rhel8:5-499
[root@9f577cbc5cf4 /]# uname -a
Linux 9f577cbc5cf4 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux
[root@9f577cbc5cf4 03874568]# file storage3-ceph-bluestore-core-dump
storage3-ceph-bluestore-core-dump: ELF 64-bit MSB core file, IBM S/390, version 1 (SYSV), SVR4-style, from 'ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/ceph-bluestore-tool', platform: 'z900'
[root@9f577cbc5cf4 03874568]# gdb $(which ceph-bluestore-tool) storage3-ceph-bluestore-core-dump
Reading symbols from /usr/bin/ceph-bluestore-tool...Reading symbols from /usr/lib/debug/usr/bin/ceph-bluestore-tool-16.2.10-248.el8cp.s390x.debug...done.
done.
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: core file may not match specified executable file.
[New LWP 275]
Core was generated by `ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000003ffa39bca00 in ?? ()
(gdb) bt full
Python Exception <class 'gdb.error'> PC not saved:
#0 0x000003ffa39bca00 in ?? ()
No symbol table info available.
(gdb) list
212
213 void inferring_bluefs_devices(vector<string>& devs, std::string& path)
214 {
215 cout << "inferring bluefs devices from bluestore path" << std::endl;
216 for (auto fn :
) {
217 string p = path + "/" + fn;
218 struct stat st;
219 if (::stat(p.c_str(), &st) == 0)
—
---
[root@9f577cbc5cf4 03874568]# exit
exit
[root@s390x-kvm-056 coredumps]# yum list installed | grep ceph
ceph-base.s390x 2:16.2.10-248.el8cp @@commandline
ceph-base-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-common.s390x 2:16.2.10-248.el8cp @@commandline
ceph-common-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-debugsource.s390x 2:16.2.10-248.el8cp @@commandline
ceph-fuse.s390x 2:16.2.10-248.el8cp @@commandline
ceph-fuse-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-immutable-object-cache.s390x 2:16.2.10-248.el8cp @@commandline
ceph-immutable-object-cache-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-mds.s390x 2:16.2.10-248.el8cp @@commandline
ceph-mds-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-mgr-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-mgr-modules-core.noarch 2:18.2.0-1.fc39 @@commandline
ceph-mon.s390x 2:16.2.10-248.el8cp @@commandline
ceph-mon-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-osd.s390x 2:16.2.10-248.el8cp @@commandline
ceph-osd-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-radosgw.s390x 2:16.2.10-248.el8cp @@commandline
ceph-radosgw-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
ceph-resource-agents.s390x 2:16.2.10-248.el8cp @@commandline
ceph-selinux.s390x 2:16.2.10-248.el8cp @@commandline
ceph-test.s390x 2:16.2.10-248.el8cp @@commandline
ceph-test-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
cephfs-mirror.s390x 2:16.2.10-248.el8cp @@commandline
cephfs-mirror-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
libcephfs-devel.s390x 2:16.2.10-248.el8cp @@commandline
libcephfs2.s390x 2:16.2.10-248.el8cp @@commandline
libcephfs2-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
libcephsqlite.s390x 2:16.2.10-248.el8cp @@commandline
libcephsqlite-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
libcephsqlite-devel.s390x 2:16.2.10-248.el8cp @@commandline
python3-ceph-argparse.s390x 2:16.2.10-248.el8cp @@commandline
python3-ceph-common.s390x 2:16.2.10-248.el8cp @@commandline
python3-cephfs.s390x 2:16.2.10-248.el8cp @@commandline
python3-cephfs-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
[root@s390x-kvm-056 03874568]# gdb /usr/bin/ceph-bluestore-tool storage3-ceph-bluestore-core-dump
Reading symbols from /usr/bin/ceph-bluestore-tool...Reading symbols from /usr/lib/debug/usr/bin/ceph-bluestore-tool-16.2.10-248.el8cp.s390x.debug...done.
done.
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: core file may not match specified executable file.
[New LWP 275]
Core was generated by `ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000003ffa39bca00 in ?? ()
(gdb) bt
Python Exception <class 'gdb.error'> PC not saved:
#0 0x000003ffa39bca00 in ?? ()
—
Version of all relevant components (if applicable):
OCP Version 4.12.53
ODF Version 4.12.14
Node OS: Linux storage3.ocpmfdc0p01.enterprise.wistate.us 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Yes, all OSDs are in CLBO. No workloads are able to run.
Is there any workaround available to the best of your knowledge? No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 4
Can this issue reproducible? Not that I am aware of
Can this issue reproduce from the UI? Not that I am aware of
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Relevant attachments are in supportshell
sosreports for storage1 and storage3
coredumps from all nodes
ODF and OCP must-gather
I currently have a s390x architecture lab running, please let me know if there is any assistance I can provide.
- external trackers