Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.14
Component/s: ceph/RADOS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2297267
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.18
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
Ceph mon (mon.b) aborted in thread_name:msgr-worker-1 while performing repeated OCP worker machine config pool reboots.

sh-5.1$ ceph crash ls
h cID ENTITY NEW
2024-07-11T05:01:39.337305Z_bc04869e-9361-4a42-886b-0206f18e1d23 mon.b *
Log snip:
==========

-8> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
-7> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
-6> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
-5> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 mon.b@-1(???) e3 init
-4> 2024-07-11T05:01:39.290+0000 7f77c3992900 4 mgrc handle_mgr_map Got map version 928
-3> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc handle_mgr_map Active mgr is now [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
-2> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc reconnect Starting new session with [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
-1> 2024-07-11T05:01:39.315+0000 7f77c3992900 0 mon.b@-1(probing) e3 my rank is now 1 (was -1)
0> 2024-07-11T05:01:39.339+0000 7f77baf63640 -1 *** Caught signal (Aborted) **
in thread 7f77baf63640 thread_name:msgr-worker-1

ceph version 17.2.6-216.0.hotfix.bz2266538.el9cp (e3968f91dc6b6b52eea5a64d169887c551d0d99c) quincy (stable)
1: /lib64/libc.so.6(+0x3e6f0) [0x7f77c3f336f0]
2: /lib64/libc.so.6(+0x8b94c) [0x7f77c3f8094c]
3: raise()
4: abort()
5: /lib64/libstdc++.so.6(+0xa1b21) [0x7f77c4295b21]
6: /lib64/libstdc++.so.6(+0xad52c) [0x7f77c42a152c]
7: /lib64/libstdc++.so.6(+0xad597) [0x7f77c42a1597]
8: /lib64/libstdc++.so.6(+0xad7f9) [0x7f77c42a17f9]
9: /usr/lib64/ceph/libceph-common.so.2(+0x137e4b) [0x7f77c4813e4b]
10: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x613) [0x7f77c4acdb83]
11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x39) [0x7f77c4ab97a9]
12: (AsyncConnection::process()+0x42b) [0x7f77c4a99e7b]
13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1c1) [0x7f77c4ae1401]
14: /usr/lib64/ceph/libceph-common.so.2(+0x405eb6) [0x7f77c4ae1eb6]
15: /lib64/libstdc++.so.6(+0xdbad4) [0x7f77c42cfad4]
16: /lib64/libc.so.6(+0x89c02) [0x7f77c3f7ec02]
17: /lib64/libc.so.6(+0x10ec40) [0x7f77c4003c40]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

— logging levels —
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 rgw_access
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
1/ 5 ceph_exporter
-2/-2 (syslog threshold)
99/99 (stderr threshold)
— pthread ID / name mapping for recent threads —
7f77b9760640 / ceph-mon
7f77baf63640 / msgr-worker-1
7f77c2930640 / admin_socket
7f77c3992900 / ceph-mon
max_recent 10000
max_new 10000

Version of all relevant components (if applicable):
OCP: 4.14.31
ODF: 4.14.9
Ceph: 17.2.6_216.0.hotfix.bz2266538

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The ceph health went to warning state due to this mon crash which can be made healthy by archiving the crash.

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Reporting upon the first occurrence.

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
===================
1) Deploy a OCP + ODF cluster
2) Reboot OCP worker machine config pool
3) Wait for the machine config pool worker to start updating
4) wait for the machine config pool worker to stop updating

Repeat the above steps many times.

Actual results:
===============
mon.b aborted during the 71st iteration.

Expected results:
=================
No crashes should be observed.

Assignee:: Radoslaw Zarzynski

Reporter:: Tirumala Satya Prasad Desala

Need Info From:: Tirumala Satya Prasad Desala

QA Contact:: Tirumala Satya Prasad Desala

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2024/07/11 7:17 AM

Updated:: 2025/01/08 4:07 AM

Resolved:: 2025/01/08 4:07 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty