-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.14
-
None
Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
Ceph osd (osd.5) aborted in thread 7f71d41a9640 thread_name:safe_timer during a cluster full test. The cluster was filled up to the full_ratio limits of 85%, after which the test failed while attempting to establish a connection to prometheus. It could be that the osd might have crashed during the tear down part of the test while deleting the data.
sh-5.1$ ceph crash ls
epID ENTITY NEW
2024-07-09T05:50:41.011720Z_16c75204-1166-4892-a9db-9c97b34c8270 osd.5 *
ceph.osd.5 log snip:
====================
-4> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 648 pg[1.b( v 645'9977 (219'6281,645'9977] local-lis/les=603/604 n=195 ec=32/8 lis/c=603/603 les/c/f=604/604/635 sis=646) [3,2] r=-1 lpr=646 pi=[603,646)/1 crt=645'9977 lcod 645'9976 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
-3> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active+clean] exit Started/Primary/Active/Clean 48155.259186 18241 0.193717
-2> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary/Active 48379.727367 0 0.000000
-1> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary 48380.649031 0 0.000000
0> 2024-07-09T05:50:41.011+0000 7f71d41a9640 -1 *** Caught signal (Aborted) **
in thread 7f71d41a9640 thread_name:safe_timer
ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)
1: /lib64/libc.so.6(+0x54db0) [0x7f71efe9ddb0]
2: /lib64/libc.so.6(+0xa154c) [0x7f71efeea54c]
3: raise()
4: abort()
5: /lib64/libc.so.6(+0x2871b) [0x7f71efe7171b]
6: /lib64/libc.so.6(+0x4dca6) [0x7f71efe96ca6]
7: ceph-osd(+0x843de8) [0x557239663de8]
8: (Message::encode(unsigned long, int, bool)+0x2e) [0x5572399fc5ae]
9: (ProtocolV2::send_message(Message*)+0x241) [0x557239b83991]
10: (AsyncConnection::send_message(Message*)+0x276) [0x557239b701f6]
11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1d1) [0x55723934c591]
12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x772) [0x5572396797b2]
13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x652) [0x557239672df2]
14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x38f) [0x55723948223f]
15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x57) [0x5572394873c7]
16: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb73) [0x557239489a93]
17: ceph-osd(+0x5b8d5e) [0x5572393d8d5e]
18: (CommonSafeTimer<std::mutex>::timer_thread()+0x11a) [0x5572398ddb3a]
19: ceph-osd(+0xabe3f1) [0x5572398de3f1]
20: /lib64/libc.so.6(+0x9f802) [0x7f71efee8802]
21: /lib64/libc.so.6(+0x3f450) [0x7f71efe88450]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
— logging levels —
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 rgw_access
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
1/ 5 ceph_exporter
-2/-2 (syslog threshold)
99/99 (stderr threshold)
— pthread ID / name mapping for recent threads —
7f71ca195640 / osd_srv_heartbt
7f71ca996640 / tp_osd_tp
7f71cb197640 / tp_osd_tp
7f71cb998640 / tp_osd_tp
7f71cc199640 / tp_osd_tp
7f71cc99a640 / tp_osd_tp
7f71cd19b640 / tp_osd_tp
7f71cd99c640 / tp_osd_tp
7f71ce19d640 / tp_osd_tp
7f71ce99e640 / tp_osd_tp
7f71cf19f640 / tp_osd_tp
7f71cf9a0640 / tp_osd_tp
7f71d01a1640 / tp_osd_tp
7f71d09a2640 / tp_osd_tp
7f71d11a3640 / tp_osd_tp
7f71d19a4640 / tp_osd_tp
7f71d21a5640 / tp_osd_tp
7f71d41a9640 / safe_timer
7f71db1b7640 / ceph-osd
7f71db9b8640 / cfin
7f71dc1b9640 / bstore_kv_sync
7f71dc9ba640 / bstore_kv_final
7f71dd5d8640 / ms_dispatch
7f71de5da640 / safe_timer
7f71df5dc640 / ms_dispatch
7f71e01c1640 / bstore_mempool
7f71e39d5640 / bstore_aio
7f71e59d9640 / safe_timer
7f71ea1e2640 / safe_timer
7f71eb3a2640 / io_context_pool
7f71ecd0a640 / io_context_pool
7f71ed50b640 / admin_socket
7f71edd0c640 / msgr-worker-2
7f71ee50d640 / msgr-worker-1
7f71eed0e640 / msgr-worker-0
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.5.log
— end dump of recent events —
2024-07-09T05:51:43.199+0000 7f64499295c0 0 set uid:gid to 167:167 (ceph:ceph)
2024-07-09T05:51:43.199+0000 7f64499295c0 0 ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable), process ceph-osd, pid 39716
Version of all relevant components (if applicable):
OCP: 4.14.31
ODF: v4.14.9-rhodf
Ceph: ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The ceph health went to warning state due to this osd crash which can be made healthy by archiving the crash.
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
It could possibly be a race condition
Can this issue reproduce from the UI?
N/A
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
===================
The crash was observed while running - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/cross_functional/system_test/test_cluster_full_and_recovery.py
The test performs below steps,
1.Create PVC1 [FS + RBD]
2.Verify new PVC1 [FS + RBD] on Bound state
3.Run FIO on PVC1_FS + PVC1_RBD
4.Calculate Checksum PVC1_FS + PVC1_RBD
5.Fill the cluster to “Full ratio” (usually 85%) with benchmark-operator
6.Verify Alerts are seen ["CephClusterCriticallyFull", "CephOSDNearFull"]
7.Respin 'osd' ,'mgr', 'mon' pods
8.Create PVC2 [FS + RBD]
9.Verify PVC2 [FS + RBD] are in Pending state
10.Create snapshot from PVC1 [FS+RBD]
11.Verify snapshots on false state
12.Change Ceph full_ratiofrom from 85% to 95%
13.Delete benchmark-operator PVCs
14.Verify PVC2 [FS + RBD] are moved to Bound state
15.Verify snapshots moved from false state to true state
16.Restore new pvc from snapshot pvc [RBD + FS]
17.Verify checksum PVC1 equal to PVC1_RESTORE
18.Change Ceph full_ratiofrom from 95% to 85%
It failed at step-6 while establishing a connection to the prometheus
Actual results:
===============
one of the ceph osd crashed
Expected results:
==================
No ceph crashes