Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-698

[2297267] Ceph mon aborted in thread_name:msgr-worker-1

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.14
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      ==================================================================================
      Ceph mon (mon.b) aborted in thread_name:msgr-worker-1 while performing repeated OCP worker machine config pool reboots.

      sh-5.1$ ceph crash ls
      h cID ENTITY NEW
      2024-07-11T05:01:39.337305Z_bc04869e-9361-4a42-886b-0206f18e1d23 mon.b *
      Log snip:
      ==========

      -8> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
      -7> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
      -6> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
      -5> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 mon.b@-1(???) e3 init
      -4> 2024-07-11T05:01:39.290+0000 7f77c3992900 4 mgrc handle_mgr_map Got map version 928
      -3> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc handle_mgr_map Active mgr is now [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
      -2> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc reconnect Starting new session with [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
      -1> 2024-07-11T05:01:39.315+0000 7f77c3992900 0 mon.b@-1(probing) e3 my rank is now 1 (was -1)
      0> 2024-07-11T05:01:39.339+0000 7f77baf63640 -1 *** Caught signal (Aborted) **
      in thread 7f77baf63640 thread_name:msgr-worker-1

      ceph version 17.2.6-216.0.hotfix.bz2266538.el9cp (e3968f91dc6b6b52eea5a64d169887c551d0d99c) quincy (stable)
      1: /lib64/libc.so.6(+0x3e6f0) [0x7f77c3f336f0]
      2: /lib64/libc.so.6(+0x8b94c) [0x7f77c3f8094c]
      3: raise()
      4: abort()
      5: /lib64/libstdc++.so.6(+0xa1b21) [0x7f77c4295b21]
      6: /lib64/libstdc++.so.6(+0xad52c) [0x7f77c42a152c]
      7: /lib64/libstdc++.so.6(+0xad597) [0x7f77c42a1597]
      8: /lib64/libstdc++.so.6(+0xad7f9) [0x7f77c42a17f9]
      9: /usr/lib64/ceph/libceph-common.so.2(+0x137e4b) [0x7f77c4813e4b]
      10: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x613) [0x7f77c4acdb83]
      11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x39) [0x7f77c4ab97a9]
      12: (AsyncConnection::process()+0x42b) [0x7f77c4a99e7b]
      13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1c1) [0x7f77c4ae1401]
      14: /usr/lib64/ceph/libceph-common.so.2(+0x405eb6) [0x7f77c4ae1eb6]
      15: /lib64/libstdc++.so.6(+0xdbad4) [0x7f77c42cfad4]
      16: /lib64/libc.so.6(+0x89c02) [0x7f77c3f7ec02]
      17: /lib64/libc.so.6(+0x10ec40) [0x7f77c4003c40]
      NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

      — logging levels —
      0/ 5 none
      0/ 1 lockdep
      0/ 1 context
      1/ 1 crush
      1/ 5 mds
      1/ 5 mds_balancer
      1/ 5 mds_locker
      1/ 5 mds_log
      1/ 5 mds_log_expire
      1/ 5 mds_migrator
      0/ 1 buffer
      0/ 1 timer
      0/ 1 filer
      0/ 1 striper
      0/ 1 objecter
      0/ 5 rados
      0/ 5 rbd
      0/ 5 rbd_mirror
      0/ 5 rbd_replay
      0/ 5 rbd_pwl
      0/ 5 journaler
      0/ 5 objectcacher
      0/ 5 immutable_obj_cache
      0/ 5 client
      1/ 5 osd
      0/ 5 optracker
      0/ 5 objclass
      1/ 3 filestore
      1/ 3 journal
      0/ 0 ms
      1/ 5 mon
      0/10 monc
      1/ 5 paxos
      0/ 5 tp
      1/ 5 auth
      1/ 5 crypto
      1/ 1 finisher
      1/ 1 reserver
      1/ 5 heartbeatmap
      1/ 5 perfcounter
      1/ 5 rgw
      1/ 5 rgw_sync
      1/ 5 rgw_datacache
      1/10 civetweb
      1/ 5 rgw_access
      1/ 5 javaclient
      1/ 5 asok
      1/ 1 throttle
      0/ 0 refs
      1/ 5 compressor
      1/ 5 bluestore
      1/ 5 bluefs
      1/ 3 bdev
      1/ 5 kstore
      4/ 5 rocksdb
      4/ 5 leveldb
      4/ 5 memdb
      1/ 5 fuse
      2/ 5 mgr
      1/ 5 mgrc
      1/ 5 dpdk
      1/ 5 eventtrace
      1/ 5 prioritycache
      0/ 5 test
      0/ 5 cephfs_mirror
      0/ 5 cephsqlite
      0/ 5 seastore
      0/ 5 seastore_onode
      0/ 5 seastore_odata
      0/ 5 seastore_omap
      0/ 5 seastore_tm
      0/ 5 seastore_cleaner
      0/ 5 seastore_lba
      0/ 5 seastore_cache
      0/ 5 seastore_journal
      0/ 5 seastore_device
      0/ 5 alienstore
      1/ 5 mclock
      1/ 5 ceph_exporter
      -2/-2 (syslog threshold)
      99/99 (stderr threshold)
      — pthread ID / name mapping for recent threads —
      7f77b9760640 / ceph-mon
      7f77baf63640 / msgr-worker-1
      7f77c2930640 / admin_socket
      7f77c3992900 / ceph-mon
      max_recent 10000
      max_new 10000

      Version of all relevant components (if applicable):
      OCP: 4.14.31
      ODF: 4.14.9
      Ceph: 17.2.6_216.0.hotfix.bz2266538

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      The ceph health went to warning state due to this mon crash which can be made healthy by archiving the crash.

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      3

      Can this issue reproducible?
      Reporting upon the first occurrence.

      Can this issue reproduce from the UI?
      N/A

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      ===================
      1) Deploy a OCP + ODF cluster
      2) Reboot OCP worker machine config pool
      3) Wait for the machine config pool worker to start updating
      4) wait for the machine config pool worker to stop updating

      Repeat the above steps many times.

      Actual results:
      ===============
      mon.b aborted during the 71st iteration.

      Expected results:
      =================
      No crashes should be observed.

              rzarzyns@redhat.com Radoslaw Zarzynski
              tdesala@redhat.com Tirumala Satya Prasad Desala
              Tirumala Satya Prasad Desala
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: