Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-145

[2250227] [ODF-4.13.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5-6 - 1 daemons have recently crashed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.13.13
    • odf-4.13
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      This bug was initially created as a copy of Bug #2249844

      I am copying this bug because:

      Description of problem (please be detailed as possible and provide log
      snippests):
      After upgrade execution to 4.13.5-6 from 4.12 - (both OCP and ODF upgrade)
      we see ceph health warn issue:

      sh-5.1$ ceph status
      cluster:
      id: 68dc565f-f700-4312-93be-265b7ed15941
      health: HEALTH_WARN
      1 daemons have recently crashed

      services:
      mon: 3 daemons, quorum a,b,c (age 78m)
      mgr: a(active, since 77m)
      mds: 1/1 daemons up, 1 hot standby
      osd: 3 osds: 3 up (since 77m), 3 in (since 2h)
      rgw: 1 daemon active (1 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 12 pools, 185 pgs
      objects: 1.05k objects, 2.0 GiB
      usage: 5.9 GiB used, 1.5 TiB / 1.5 TiB avail
      pgs: 185 active+clean

      io:
      client: 1.4 KiB/s rd, 134 KiB/s wr, 2 op/s rd, 2 op/s wr

      sh-5.1$ ceph crash ls
      ID ENTITY NEW
      2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a mgr.a *
      sh-5.1$ ceph crash info 2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a
      {
      "backtrace": [
      "/lib64/libc.so.6(+0x54df0) [0x7f7c91f2bdf0]",
      "/lib64/libc.so.6(+0xa154c) [0x7f7c91f7854c]",
      "raise()",
      "abort()",
      "/lib64/libstdc++.so.6(+0xa1a01) [0x7f7c92279a01]",
      "/lib64/libstdc++.so.6(+0xad37c) [0x7f7c9228537c]",
      "/lib64/libstdc++.so.6(+0xad3e7) [0x7f7c922853e7]",
      "/lib64/libstdc++.so.6(+0xad649) [0x7f7c92285649]",
      "/usr/lib64/ceph/libceph-common.so.2(+0x170d39) [0x7f7c9256fd39]",
      "(SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3b) [0x7f7c926a7f4b]",
      "/lib64/libcephfs.so.2(+0xaaec7) [0x7f7c86c43ec7]",
      "/lib64/libcephfs.so.2(+0xacc59) [0x7f7c86c45c59]",
      "/lib64/libcephfs.so.2(+0xadf10) [0x7f7c86c46f10]",
      "/lib64/libcephfs.so.2(+0x929e8) [0x7f7c86c2b9e8]",
      "(DispatchQueue::entry()+0x53a) [0x7f7c9272defa]",
      "/usr/lib64/ceph/libceph-common.so.2(+0x3bab31) [0x7f7c927b9b31]",
      "/lib64/libc.so.6(+0x9f802) [0x7f7c91f76802]",
      "/lib64/libc.so.6(+0x3f450) [0x7f7c91f16450]"
      ],
      "ceph_version": "17.2.6-148.el9cp",
      "crash_id": "2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a",
      "entity_name": "mgr.a",
      "os_id": "rhel",
      "os_name": "Red Hat Enterprise Linux",
      "os_version": "9.2 (Plow)",
      "os_version_id": "9.2",
      "process_name": "ceph-mgr",
      "stack_sig": "4cb0911c06087a31d9752535de90ba18fd7aab25c037945b2c61f584dcf6a6db",
      "timestamp": "2023-11-15T08:10:44.427601Z",
      "utsname_hostname": "rook-ceph-mgr-a-5d475468dd-wzhmt",
      "utsname_machine": "x86_64",
      "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
      "utsname_sysname": "Linux",
      "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
      }

      Discussed here:
      https://chat.google.com/room/AAAAREGEba8/fZvCCW1MQfU

      Venky pointed out that it smells like this issue:
      https://tracker.ceph.com/issues/63188
      BZ:
      https://bugzilla.redhat.com/show_bug.cgi?id=2247174

      Venky cloned the 7.0 BZ to 6.1z4 target - https://bugzilla.redhat.com/show_bug.cgi?id=2249814

      Version of all relevant components (if applicable):
      ODF 4.13.5-6

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?
      Trying to reproduce here:
      https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/32/

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Install ODF 4.12 and OCP 4.12
      2. Upgrade OCP to 4.13
      3. Upgrade ODF to 4.13.5-6 build
      4. After some time we see the health warn

      Actual results:
      Do not have health warn

      Expected results:

      Additional info:
      Must gather:
      http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-031vue1cslv33-uba/j-031vue1cslv33-uba_20231115T053551/logs/testcases_1700036781/j-031vue1cslv33-u/
      Job:
      https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/31/

              vshankar@redhat.com Venky Shankar
              sheggodu@redhat.com Sunil Kumar Heggodu Gopala Acharya
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: