-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.14
-
None
Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
Ceph mon (mon.b) aborted in thread_name:msgr-worker-1 while performing repeated OCP worker machine config pool reboots.
sh-5.1$ ceph crash ls
h cID ENTITY NEW
2024-07-11T05:01:39.337305Z_bc04869e-9361-4a42-886b-0206f18e1d23 mon.b *
Log snip:
==========
-8> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
-7> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
-6> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
-5> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 mon.b@-1(???) e3 init
-4> 2024-07-11T05:01:39.290+0000 7f77c3992900 4 mgrc handle_mgr_map Got map version 928
-3> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc handle_mgr_map Active mgr is now [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
-2> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc reconnect Starting new session with [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
-1> 2024-07-11T05:01:39.315+0000 7f77c3992900 0 mon.b@-1(probing) e3 my rank is now 1 (was -1)
0> 2024-07-11T05:01:39.339+0000 7f77baf63640 -1 *** Caught signal (Aborted) **
in thread 7f77baf63640 thread_name:msgr-worker-1
ceph version 17.2.6-216.0.hotfix.bz2266538.el9cp (e3968f91dc6b6b52eea5a64d169887c551d0d99c) quincy (stable)
1: /lib64/libc.so.6(+0x3e6f0) [0x7f77c3f336f0]
2: /lib64/libc.so.6(+0x8b94c) [0x7f77c3f8094c]
3: raise()
4: abort()
5: /lib64/libstdc++.so.6(+0xa1b21) [0x7f77c4295b21]
6: /lib64/libstdc++.so.6(+0xad52c) [0x7f77c42a152c]
7: /lib64/libstdc++.so.6(+0xad597) [0x7f77c42a1597]
8: /lib64/libstdc++.so.6(+0xad7f9) [0x7f77c42a17f9]
9: /usr/lib64/ceph/libceph-common.so.2(+0x137e4b) [0x7f77c4813e4b]
10: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x613) [0x7f77c4acdb83]
11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x39) [0x7f77c4ab97a9]
12: (AsyncConnection::process()+0x42b) [0x7f77c4a99e7b]
13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1c1) [0x7f77c4ae1401]
14: /usr/lib64/ceph/libceph-common.so.2(+0x405eb6) [0x7f77c4ae1eb6]
15: /lib64/libstdc++.so.6(+0xdbad4) [0x7f77c42cfad4]
16: /lib64/libc.so.6(+0x89c02) [0x7f77c3f7ec02]
17: /lib64/libc.so.6(+0x10ec40) [0x7f77c4003c40]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
— logging levels —
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 rgw_access
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
1/ 5 ceph_exporter
-2/-2 (syslog threshold)
99/99 (stderr threshold)
— pthread ID / name mapping for recent threads —
7f77b9760640 / ceph-mon
7f77baf63640 / msgr-worker-1
7f77c2930640 / admin_socket
7f77c3992900 / ceph-mon
max_recent 10000
max_new 10000
Version of all relevant components (if applicable):
OCP: 4.14.31
ODF: 4.14.9
Ceph: 17.2.6_216.0.hotfix.bz2266538
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The ceph health went to warning state due to this mon crash which can be made healthy by archiving the crash.
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
Reporting upon the first occurrence.
Can this issue reproduce from the UI?
N/A
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
===================
1) Deploy a OCP + ODF cluster
2) Reboot OCP worker machine config pool
3) Wait for the machine config pool worker to start updating
4) wait for the machine config pool worker to stop updating
Repeat the above steps many times.
Actual results:
===============
mon.b aborted during the 71st iteration.
Expected results:
=================
No crashes should be observed.