Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-847

[RDR] Ceph-osd crashed with reason Message::encode(unsigned long, int, bool)+0x2e) after upgrading ceph from 18.2.1-229.el9cp to 19.2.0-47.el9cp

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • odf-4.17
    • odf-4.18
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      [RDR] Ceph-osd crashed with reason Message::encode(unsigned long, int, bool)+0x2e) after upgrading ceph from 18.2.1-229.el9cp to 19.2.0-47.el9cp

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

       Vmware-UPI

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

       RDR

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      OCP version:- 4.18.0-0.nightly-2024-11-07-215008
      ODF version:- 4.18.0-49
      CEPH version:- ceph version 19.2.0-47.el9cp (123a317ae596caa7f6d087fc76fffb6a736e0b5f) squid (stable)
      ACM version:- 2.12.0
      SUBMARINER version:- v0.19.0
      VOLSYNC version:- volsync-product.v0.10.1
      OADP version:- 1.4.1
      VOLSYNC method:- destinationCopyMethod: Direct

       

      Does this issue impact your ability to continue to work with the product?

       

       

      Is there any workaround available to the best of your knowledge?

       

       

      Can this issue be reproduced? If so, please provide the hit rate

       

       

      Can this issue be reproduced from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1.Deploy 4.17 RDR cluster

      2.Run some workloads

      3.Upgrade cluster to 4.18
      4. check ceph status

      The exact date and time when the issue was observed, including timezone details:

       2024-11-12T13:11:54.348893

      Actual results:

        

      $ceph crash ls
      ID ENTITY NEW
      2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54 osd.0 *

       

      $ ceph crash info 2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54
      {
          "backtrace": [
              "/lib64/libc.so.6(+0x3e6f0) [0x7f44605fd6f0]",
              "/lib64/libc.so.6(+0x8b94c) [0x7f446064a94c]",
              "raise()",
              "abort()",
              "/lib64/libc.so.6(+0x2871b) [0x7f44605e771b]",
              "/lib64/libc.so.6(+0x37386) [0x7f44605f6386]",
              "ceph-osd(+0x8ca478) [0x56359cd6a478]",
              "(Message::encode(unsigned long, int, bool)+0x2e) [0x56359d09275e]",
              "(ProtocolV2::send_message(Message*)+0xc9) [0x56359d24a529]",
              "(AsyncConnection::send_message(Message*)+0x276) [0x56359d234396]",
              "(OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1cc) [0x56359c9df4dc]",
              "(ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x79d) [0x56359cd7f3cd]",
              "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x652) [0x56359cd7fbd2]",
              "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x392) [0x56359cb46202]",
              "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x59) [0x56359cb4a529]",
              "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xcd0) [0x56359cb4de90]",
              "ceph-osd(+0x5eb53e) [0x56359ca8b53e]",
              "(CommonSafeTimer<std::mutex>::timer_thread()+0x12a) [0x56359cf64c1a]",
              "ceph-osd(+0xac55b1) [0x56359cf655b1]",
              "/lib64/libc.so.6(+0x89c02) [0x7f4460648c02]",
              "/lib64/libc.so.6(+0x10ec40) [0x7f44606cdc40]"
          ],
          "ceph_version": "19.2.0-47.el9cp",
          "crash_id": "2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54",
          "entity_name": "osd.0",
          "os_id": "rhel",
          "os_name": "Red Hat Enterprise Linux",
          "os_version": "9.4 (Plow)",
          "os_version_id": "9.4",
          "process_name": "ceph-osd",
          "stack_sig": "338c287e4eeaae0ee1893eb6f465b526af91f5ddc6f2a6689fe7b2e8097cd083",
          "timestamp": "2024-11-12T13:11:54.348893Z",
          "utsname_hostname": "rook-ceph-osd-0-bc49d7f7f-hsjsz",
          "utsname_machine": "x86_64",
          "utsname_release": "5.14.0-427.44.1.el9_4.x86_64",
          "utsname_sysname": "Linux",
          "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Nov 1 14:40:56 EDT 2024"
      }
      

      Expected results:

       There should not be any crash

      Logs collected and log location:

       http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/sync_issue/

      Additional info:

       

              rzarzyns@redhat.com Radoslaw Zarzynski
              prsurve@redhat.com Pratik Surve
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: