Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-280

[2296500] Ceph osd aborted in thread 7f71d41a9640 thread_name:safe_timer

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.14
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • Committed
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      ==================================================================================
      Ceph osd (osd.5) aborted in thread 7f71d41a9640 thread_name:safe_timer during a cluster full test. The cluster was filled up to the full_ratio limits of 85%, after which the test failed while attempting to establish a connection to prometheus. It could be that the osd might have crashed during the tear down part of the test while deleting the data.

      sh-5.1$ ceph crash ls
      epID ENTITY NEW
      2024-07-09T05:50:41.011720Z_16c75204-1166-4892-a9db-9c97b34c8270 osd.5 *

      ceph.osd.5 log snip:
      ====================
      -4> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 648 pg[1.b( v 645'9977 (219'6281,645'9977] local-lis/les=603/604 n=195 ec=32/8 lis/c=603/603 les/c/f=604/604/635 sis=646) [3,2] r=-1 lpr=646 pi=[603,646)/1 crt=645'9977 lcod 645'9976 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
      -3> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active+clean] exit Started/Primary/Active/Clean 48155.259186 18241 0.193717
      -2> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary/Active 48379.727367 0 0.000000
      -1> 2024-07-09T05:50:41.769+0000 7f71d09a2640 5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary 48380.649031 0 0.000000
      0> 2024-07-09T05:50:41.011+0000 7f71d41a9640 -1 *** Caught signal (Aborted) **
      in thread 7f71d41a9640 thread_name:safe_timer

      ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)
      1: /lib64/libc.so.6(+0x54db0) [0x7f71efe9ddb0]
      2: /lib64/libc.so.6(+0xa154c) [0x7f71efeea54c]
      3: raise()
      4: abort()
      5: /lib64/libc.so.6(+0x2871b) [0x7f71efe7171b]
      6: /lib64/libc.so.6(+0x4dca6) [0x7f71efe96ca6]
      7: ceph-osd(+0x843de8) [0x557239663de8]
      8: (Message::encode(unsigned long, int, bool)+0x2e) [0x5572399fc5ae]
      9: (ProtocolV2::send_message(Message*)+0x241) [0x557239b83991]
      10: (AsyncConnection::send_message(Message*)+0x276) [0x557239b701f6]
      11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1d1) [0x55723934c591]
      12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x772) [0x5572396797b2]
      13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x652) [0x557239672df2]
      14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x38f) [0x55723948223f]
      15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x57) [0x5572394873c7]
      16: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb73) [0x557239489a93]
      17: ceph-osd(+0x5b8d5e) [0x5572393d8d5e]
      18: (CommonSafeTimer<std::mutex>::timer_thread()+0x11a) [0x5572398ddb3a]
      19: ceph-osd(+0xabe3f1) [0x5572398de3f1]
      20: /lib64/libc.so.6(+0x9f802) [0x7f71efee8802]
      21: /lib64/libc.so.6(+0x3f450) [0x7f71efe88450]
      NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

      — logging levels —
      0/ 5 none
      0/ 1 lockdep
      0/ 1 context
      1/ 1 crush
      1/ 5 mds
      1/ 5 mds_balancer
      1/ 5 mds_locker
      1/ 5 mds_log
      1/ 5 mds_log_expire
      1/ 5 mds_migrator
      0/ 1 buffer
      0/ 1 timer
      0/ 1 filer
      0/ 1 striper
      0/ 1 objecter
      0/ 5 rados
      0/ 5 rbd
      0/ 5 rbd_mirror
      0/ 5 rbd_replay
      0/ 5 rbd_pwl
      0/ 5 journaler
      0/ 5 objectcacher
      0/ 5 immutable_obj_cache
      0/ 5 client
      1/ 5 osd
      0/ 5 optracker
      0/ 5 objclass
      1/ 3 filestore
      1/ 3 journal
      0/ 0 ms
      1/ 5 mon
      0/10 monc
      1/ 5 paxos
      0/ 5 tp
      1/ 5 auth
      1/ 5 crypto
      1/ 1 finisher
      1/ 1 reserver
      1/ 5 heartbeatmap
      1/ 5 perfcounter
      1/ 5 rgw
      1/ 5 rgw_sync
      1/ 5 rgw_datacache
      1/10 civetweb
      1/ 5 rgw_access
      1/ 5 javaclient
      1/ 5 asok
      1/ 1 throttle
      0/ 0 refs
      1/ 5 compressor
      1/ 5 bluestore
      1/ 5 bluefs
      1/ 3 bdev
      1/ 5 kstore
      4/ 5 rocksdb
      4/ 5 leveldb
      4/ 5 memdb
      1/ 5 fuse
      2/ 5 mgr
      1/ 5 mgrc
      1/ 5 dpdk
      1/ 5 eventtrace
      1/ 5 prioritycache
      0/ 5 test
      0/ 5 cephfs_mirror
      0/ 5 cephsqlite
      0/ 5 seastore
      0/ 5 seastore_onode
      0/ 5 seastore_odata
      0/ 5 seastore_omap
      0/ 5 seastore_tm
      0/ 5 seastore_cleaner
      0/ 5 seastore_lba
      0/ 5 seastore_cache
      0/ 5 seastore_journal
      0/ 5 seastore_device
      0/ 5 alienstore
      1/ 5 mclock
      1/ 5 ceph_exporter
      -2/-2 (syslog threshold)
      99/99 (stderr threshold)
      — pthread ID / name mapping for recent threads —
      7f71ca195640 / osd_srv_heartbt
      7f71ca996640 / tp_osd_tp
      7f71cb197640 / tp_osd_tp
      7f71cb998640 / tp_osd_tp
      7f71cc199640 / tp_osd_tp
      7f71cc99a640 / tp_osd_tp
      7f71cd19b640 / tp_osd_tp
      7f71cd99c640 / tp_osd_tp
      7f71ce19d640 / tp_osd_tp
      7f71ce99e640 / tp_osd_tp
      7f71cf19f640 / tp_osd_tp
      7f71cf9a0640 / tp_osd_tp
      7f71d01a1640 / tp_osd_tp
      7f71d09a2640 / tp_osd_tp
      7f71d11a3640 / tp_osd_tp
      7f71d19a4640 / tp_osd_tp
      7f71d21a5640 / tp_osd_tp
      7f71d41a9640 / safe_timer
      7f71db1b7640 / ceph-osd
      7f71db9b8640 / cfin
      7f71dc1b9640 / bstore_kv_sync
      7f71dc9ba640 / bstore_kv_final
      7f71dd5d8640 / ms_dispatch
      7f71de5da640 / safe_timer
      7f71df5dc640 / ms_dispatch
      7f71e01c1640 / bstore_mempool
      7f71e39d5640 / bstore_aio
      7f71e59d9640 / safe_timer
      7f71ea1e2640 / safe_timer
      7f71eb3a2640 / io_context_pool
      7f71ecd0a640 / io_context_pool
      7f71ed50b640 / admin_socket
      7f71edd0c640 / msgr-worker-2
      7f71ee50d640 / msgr-worker-1
      7f71eed0e640 / msgr-worker-0
      max_recent 10000
      max_new 10000
      log_file /var/log/ceph/ceph-osd.5.log
      — end dump of recent events —
      2024-07-09T05:51:43.199+0000 7f64499295c0 0 set uid:gid to 167:167 (ceph:ceph)
      2024-07-09T05:51:43.199+0000 7f64499295c0 0 ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable), process ceph-osd, pid 39716

      Version of all relevant components (if applicable):
      OCP: 4.14.31
      ODF: v4.14.9-rhodf
      Ceph: ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      The ceph health went to warning state due to this osd crash which can be made healthy by archiving the crash.

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      3

      Can this issue reproducible?
      It could possibly be a race condition

      Can this issue reproduce from the UI?
      N/A

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      ===================
      The crash was observed while running - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/cross_functional/system_test/test_cluster_full_and_recovery.py
      The test performs below steps,
      1.Create PVC1 [FS + RBD]
      2.Verify new PVC1 [FS + RBD] on Bound state
      3.Run FIO on PVC1_FS + PVC1_RBD
      4.Calculate Checksum PVC1_FS + PVC1_RBD
      5.Fill the cluster to “Full ratio” (usually 85%) with benchmark-operator
      6.Verify Alerts are seen ["CephClusterCriticallyFull", "CephOSDNearFull"]
      7.Respin 'osd' ,'mgr', 'mon' pods
      8.Create PVC2 [FS + RBD]
      9.Verify PVC2 [FS + RBD] are in Pending state
      10.Create snapshot from PVC1 [FS+RBD]
      11.Verify snapshots on false state
      12.Change Ceph full_ratiofrom from 85% to 95%
      13.Delete benchmark-operator PVCs
      14.Verify PVC2 [FS + RBD] are moved to Bound state
      15.Verify snapshots moved from false state to true state
      16.Restore new pvc from snapshot pvc [RBD + FS]
      17.Verify checksum PVC1 equal to PVC1_RESTORE
      18.Change Ceph full_ratiofrom from 95% to 85%

      It failed at step-6 while establishing a connection to the prometheus

      Actual results:
      ===============
      one of the ceph osd crashed

      Expected results:
      ==================
      No ceph crashes

              rhn-support-moagrawa Mohit Agrawal (Inactive)
              tdesala@redhat.com Tirumala Satya Prasad Desala
              Nagendra Reddy Nagendra Reddy
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: