Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: odf-4.18
Component/s: ceph/RADOS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
?
Docs Approval:
?
PM Approval:
?
QE Approval:
?
Target Release:

odf-4.17.z
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

[RDR] Ceph-osd crashed with reason Message::encode(unsigned long, int, bool)+0x2e) after upgrading ceph from 18.2.1-229.el9cp to 19.2.0-47.el9cp

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

Vmware-UPI

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

RDR

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

OCP version:- 4.18.0-0.nightly-2024-11-07-215008
ODF version:- 4.18.0-49
CEPH version:- ceph version 19.2.0-47.el9cp (123a317ae596caa7f6d087fc76fffb6a736e0b5f) squid (stable)
ACM version:- 2.12.0
SUBMARINER version:- v0.19.0
VOLSYNC version:- volsync-product.v0.10.1
OADP version:- 1.4.1
VOLSYNC method:- destinationCopyMethod: Direct

Does this issue impact your ability to continue to work with the product?

Is there any workaround available to the best of your knowledge?

Can this issue be reproduced? If so, please provide the hit rate

Can this issue be reproduced from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

1.Deploy 4.17 RDR cluster

2.Run some workloads

3.Upgrade cluster to 4.18
4. check ceph status

The exact date and time when the issue was observed, including timezone details:

2024-11-12T13:11:54.348893

Actual results:

$ceph crash ls
ID ENTITY NEW
2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54 osd.0 *

$ ceph crash info 2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54
{
    "backtrace": [
        "/lib64/libc.so.6(+0x3e6f0) [0x7f44605fd6f0]",
        "/lib64/libc.so.6(+0x8b94c) [0x7f446064a94c]",
        "raise()",
        "abort()",
        "/lib64/libc.so.6(+0x2871b) [0x7f44605e771b]",
        "/lib64/libc.so.6(+0x37386) [0x7f44605f6386]",
        "ceph-osd(+0x8ca478) [0x56359cd6a478]",
        "(Message::encode(unsigned long, int, bool)+0x2e) [0x56359d09275e]",
        "(ProtocolV2::send_message(Message*)+0xc9) [0x56359d24a529]",
        "(AsyncConnection::send_message(Message*)+0x276) [0x56359d234396]",
        "(OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1cc) [0x56359c9df4dc]",
        "(ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x79d) [0x56359cd7f3cd]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x652) [0x56359cd7fbd2]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x392) [0x56359cb46202]",
        "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x59) [0x56359cb4a529]",
        "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xcd0) [0x56359cb4de90]",
        "ceph-osd(+0x5eb53e) [0x56359ca8b53e]",
        "(CommonSafeTimer<std::mutex>::timer_thread()+0x12a) [0x56359cf64c1a]",
        "ceph-osd(+0xac55b1) [0x56359cf655b1]",
        "/lib64/libc.so.6(+0x89c02) [0x7f4460648c02]",
        "/lib64/libc.so.6(+0x10ec40) [0x7f44606cdc40]"
    ],
    "ceph_version": "19.2.0-47.el9cp",
    "crash_id": "2024-11-12T13:11:54.348893Z_3b312191-e0b7-4383-a93b-9980e9e08d54",
    "entity_name": "osd.0",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.4 (Plow)",
    "os_version_id": "9.4",
    "process_name": "ceph-osd",
    "stack_sig": "338c287e4eeaae0ee1893eb6f465b526af91f5ddc6f2a6689fe7b2e8097cd083",
    "timestamp": "2024-11-12T13:11:54.348893Z",
    "utsname_hostname": "rook-ceph-osd-0-bc49d7f7f-hsjsz",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.44.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Nov 1 14:40:56 EDT 2024"
}

Expected results:

There should not be any crash

Logs collected and log location:

http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/sync_issue/

Additional info:

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty