-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.13
-
None
This bug was initially created as a copy of Bug #2249844
I am copying this bug because:
Description of problem (please be detailed as possible and provide log
snippests):
After upgrade execution to 4.13.5-6 from 4.12 - (both OCP and ODF upgrade)
we see ceph health warn issue:
sh-5.1$ ceph status
cluster:
id: 68dc565f-f700-4312-93be-265b7ed15941
health: HEALTH_WARN
1 daemons have recently crashed
services:
mon: 3 daemons, quorum a,b,c (age 78m)
mgr: a(active, since 77m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 77m), 3 in (since 2h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 185 pgs
objects: 1.05k objects, 2.0 GiB
usage: 5.9 GiB used, 1.5 TiB / 1.5 TiB avail
pgs: 185 active+clean
io:
client: 1.4 KiB/s rd, 134 KiB/s wr, 2 op/s rd, 2 op/s wr
sh-5.1$ ceph crash ls
ID ENTITY NEW
2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a mgr.a *
sh-5.1$ ceph crash info 2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7f7c91f2bdf0]",
"/lib64/libc.so.6(+0xa154c) [0x7f7c91f7854c]",
"raise()",
"abort()",
"/lib64/libstdc++.so.6(+0xa1a01) [0x7f7c92279a01]",
"/lib64/libstdc++.so.6(+0xad37c) [0x7f7c9228537c]",
"/lib64/libstdc++.so.6(+0xad3e7) [0x7f7c922853e7]",
"/lib64/libstdc++.so.6(+0xad649) [0x7f7c92285649]",
"/usr/lib64/ceph/libceph-common.so.2(+0x170d39) [0x7f7c9256fd39]",
"(SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3b) [0x7f7c926a7f4b]",
"/lib64/libcephfs.so.2(+0xaaec7) [0x7f7c86c43ec7]",
"/lib64/libcephfs.so.2(+0xacc59) [0x7f7c86c45c59]",
"/lib64/libcephfs.so.2(+0xadf10) [0x7f7c86c46f10]",
"/lib64/libcephfs.so.2(+0x929e8) [0x7f7c86c2b9e8]",
"(DispatchQueue::entry()+0x53a) [0x7f7c9272defa]",
"/usr/lib64/ceph/libceph-common.so.2(+0x3bab31) [0x7f7c927b9b31]",
"/lib64/libc.so.6(+0x9f802) [0x7f7c91f76802]",
"/lib64/libc.so.6(+0x3f450) [0x7f7c91f16450]"
],
"ceph_version": "17.2.6-148.el9cp",
"crash_id": "2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a",
"entity_name": "mgr.a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-mgr",
"stack_sig": "4cb0911c06087a31d9752535de90ba18fd7aab25c037945b2c61f584dcf6a6db",
"timestamp": "2023-11-15T08:10:44.427601Z",
"utsname_hostname": "rook-ceph-mgr-a-5d475468dd-wzhmt",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}
Discussed here:
https://chat.google.com/room/AAAAREGEba8/fZvCCW1MQfU
Venky pointed out that it smells like this issue:
https://tracker.ceph.com/issues/63188
BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=2247174
Venky cloned the 7.0 BZ to 6.1z4 target - https://bugzilla.redhat.com/show_bug.cgi?id=2249814
Version of all relevant components (if applicable):
ODF 4.13.5-6
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Trying to reproduce here:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/32/
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Install ODF 4.12 and OCP 4.12
2. Upgrade OCP to 4.13
3. Upgrade ODF to 4.13.5-6 build
4. After some time we see the health warn
Actual results:
Do not have health warn
Expected results:
Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-031vue1cslv33-uba/j-031vue1cslv33-uba_20231115T053551/logs/testcases_1700036781/j-031vue1cslv33-u/
Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/31/