Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-292

[2299779] [IBM Z] All OSDs in CLBO (dynamic linker and libtcmalloc recursive call loop)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.12
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippets):

      All three OSDs are in CLBO, coredumps are being generated on each node. Journalctl logs show an exit code of 139.
      Customer is on the IBM Z/s390x platform.

      Journalctl logs from storage3 node:


      052fe21dd149c734 description=openshift-storage/rook-ceph-osd-1-74b6bcb5d9-8dtjf/activate id=fc79bebb-d698-4502-82f5-fa43441fb571 name=/runtime.v1.RuntimeService/StartContainer sandboxID=2c3b4d8c5663ebce3e5cef7665
      9dbd2af5174a2ca5b793c4d3d895328b703723
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User process fault: interruption code 0011 ilc:3 in ld-2.28.so[3ff99f00000+29000]
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Failing address: 000003ffcd37f000 TEID: 000003ffcd37f400
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Fault in primary space mode while using user ASCE.
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: AS:00000009916041c7 R3:0000000972f24007 S:0000000a8d87d000 P:0000000000000400
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: CPU: 4 PID: 676 Comm: ceph-bluestore- Not tainted 4.18.0-372.96.1.el8_6.s390x #1
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Hardware name: IBM 3931 A01 508 (KVM/Linux)
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User PSW : 0705200180000000 000003ff99f129a0
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User GPRS: 0000000000000000 000003ff99f68800 0000000000000002 0000000000000002
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99f69010 000003ff99bd3d20 000000000000000c 000003ff99f69010
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99f690d0 000000000000000c 000003ff99f6a280 000003ff99bd3d20
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: 000003ff99bd2e40 000003ff99bc16c8 000003ff99f12bcc 000003ffcd37ff48
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: User Code: 000003ff99f1299a: 0707 bcr 0,%r7
      000003ff99f1299c: 0707 bcr 0,%r7
      #000003ff99f1299e: 0707 bcr 0,%r7
      >000003ff99f129a0: eb6ff0300024 stmg %r6,%r15,48(%r15)
      000003ff99f129a6: b90400ef lgr %r14,%r15
      000003ff99f129aa: e3f0ff30ff71 lay %r15,-208(%r15)
      000003ff99f129b0: a7ebfff0 aghi %r14,-16
      000003ff99f129b4: 6080e000 std %f8,0(%r14)
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: Last Breaking-Event-Address:
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us kernel: [<000003ff99f12bc6>] 0x3ff99f12bc6
      Jul 16 20:20:37 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: Started Process Core Dump (PID 700/UID 0).
      Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us kubenswrapper[2587]: I0716 20:20:38.261860 2587 kubelet.go:2157] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-osd-1-74b6bcb5d9-8d
      tjf" event=&

      {ID:43edeb82-c246-4939-a904-e0e34dda95f0 Type:ContainerStarted Data:d700d73bea45364addbd025189c1d373c53a0db4a4d0e0a9052fe21dd149c734}

      Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us systemd-coredump[705]: Removed old coredump core.ceph-mon.0.166d167cfeda4c6bb0eb4da9ae845864.2814806.1721092159000000.lz4.
      Jul 16 20:20:38 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: run-runc-7e44fd45ce8b1995ae78f248705aadc728c51bb655b1e16a0c7fcd4c14026923-runc.2Oc8v3.mount: Succeeded.
      Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd-coredump[705]: Process 676 (ceph-bluestore-) of user 0 dumped core.

      Stack trace of thread 245:
      #0 0x000003ff99f129a0 n/a (/usr/lib64/ld-2.28.so)
      Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: systemd-coredump@12376-700-0.service: Succeeded.
      Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us systemd[1]: systemd-coredump@12376-700-0.service: Consumed 1.013s CPU time
      Jul 16 20:20:39 storage3.ocpmfdc0p01.enterprise.wistate.us conmon[657]: conmon d700d73bea45364addbd <ninfo>: container 676 exited with status 139

      I attempted to analyze the coredumps in a s390x lab for storage3, unsure if I was successful:


      [root@s390x-kvm-056 coredumps]# uname -a
      Linux s390x-kvm-056.lab.eng.rdu2.redhat.com 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux
      [root@s390x-kvm-056 coredumps]# podman run -it -v ./:/mnt:Z --entrypoint=/bin/bash registry.redhat.io/rhceph/rhceph-5-rhel8:5-499
      [root@9f577cbc5cf4 /]# uname -a
      Linux 9f577cbc5cf4 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux
      [root@9f577cbc5cf4 03874568]# file storage3-ceph-bluestore-core-dump
      storage3-ceph-bluestore-core-dump: ELF 64-bit MSB core file, IBM S/390, version 1 (SYSV), SVR4-style, from 'ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/ceph-bluestore-tool', platform: 'z900'
      [root@9f577cbc5cf4 03874568]# gdb $(which ceph-bluestore-tool) storage3-ceph-bluestore-core-dump
      Reading symbols from /usr/bin/ceph-bluestore-tool...Reading symbols from /usr/lib/debug/usr/bin/ceph-bluestore-tool-16.2.10-248.el8cp.s390x.debug...done.
      done.

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: core file may not match specified executable file.
      [New LWP 275]
      Core was generated by `ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0 0x000003ffa39bca00 in ?? ()
      (gdb) bt full
      Python Exception <class 'gdb.error'> PC not saved:
      #0 0x000003ffa39bca00 in ?? ()
      No symbol table info available.
      (gdb) list
      212
      213 void inferring_bluefs_devices(vector<string>& devs, std::string& path)
      214 {
      215 cout << "inferring bluefs devices from bluestore path" << std::endl;
      216 for (auto fn :

      {"block", "block.wal", "block.db"}

      ) {
      217 string p = path + "/" + fn;
      218 struct stat st;
      219 if (::stat(p.c_str(), &st) == 0)

      { 220 devs.push_back(p); 221 }


      ---
      [root@9f577cbc5cf4 03874568]# exit
      exit
      [root@s390x-kvm-056 coredumps]# yum list installed | grep ceph
      ceph-base.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-base-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-common.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-common-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-debugsource.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-fuse.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-fuse-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-immutable-object-cache.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-immutable-object-cache-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-mds.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-mds-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-mgr-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-mgr-modules-core.noarch 2:18.2.0-1.fc39 @@commandline
      ceph-mon.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-mon-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-osd.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-osd-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-radosgw.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-radosgw-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-resource-agents.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-selinux.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-test.s390x 2:16.2.10-248.el8cp @@commandline
      ceph-test-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      cephfs-mirror.s390x 2:16.2.10-248.el8cp @@commandline
      cephfs-mirror-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      libcephfs-devel.s390x 2:16.2.10-248.el8cp @@commandline
      libcephfs2.s390x 2:16.2.10-248.el8cp @@commandline
      libcephfs2-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      libcephsqlite.s390x 2:16.2.10-248.el8cp @@commandline
      libcephsqlite-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      libcephsqlite-devel.s390x 2:16.2.10-248.el8cp @@commandline
      python3-ceph-argparse.s390x 2:16.2.10-248.el8cp @@commandline
      python3-ceph-common.s390x 2:16.2.10-248.el8cp @@commandline
      python3-cephfs.s390x 2:16.2.10-248.el8cp @@commandline
      python3-cephfs-debuginfo.s390x 2:16.2.10-248.el8cp @@commandline
      [root@s390x-kvm-056 03874568]# gdb /usr/bin/ceph-bluestore-tool storage3-ceph-bluestore-core-dump
      Reading symbols from /usr/bin/ceph-bluestore-tool...Reading symbols from /usr/lib/debug/usr/bin/ceph-bluestore-tool-16.2.10-248.el8cp.s390x.debug...done.
      done.

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: Can't open file (null) during file-backed mapping note processing

      warning: core file may not match specified executable file.
      [New LWP 275]
      Core was generated by `ceph-bluestore-tool prime-osd-dir --dev /var/lib/ceph/osd/ceph-1/block --path /'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0 0x000003ffa39bca00 in ?? ()
      (gdb) bt
      Python Exception <class 'gdb.error'> PC not saved:
      #0 0x000003ffa39bca00 in ?? ()

      Version of all relevant components (if applicable):

      OCP Version 4.12.53
      ODF Version 4.12.14
      Node OS: Linux storage3.ocpmfdc0p01.enterprise.wistate.us 4.18.0-372.96.1.el8_6.s390x #1 SMP Mon Mar 4 22:41:25 EST 2024 s390x s390x s390x GNU/Linux

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)? Yes, all OSDs are in CLBO. No workloads are able to run.

      Is there any workaround available to the best of your knowledge? No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)? 4

      Can this issue reproducible? Not that I am aware of

      Can this issue reproduce from the UI? Not that I am aware of

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:
      Relevant attachments are in supportshell
      sosreports for storage1 and storage3
      coredumps from all nodes
      ODF and OCP must-gather

      I currently have a s390x architecture lab running, please let me know if there is any assistance I can provide.

              tstober@redhat.com Thomas Stober
              rhn-support-rlaberin Ryan Laberinto
              Thomas Stober
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: