Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-283

[2314917] [GSS] OSD_TOO_MANY_REPAIRS: Too many repaired reads - OSDs Crashing "verify_csum bad crc checksum"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.12
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Description of problem (please be detailed as possible and provide log snippets):

      Below are the related bugs. This will most likely be marked duplicate.

      https://bugzilla.redhat.com/show_bug.cgi?id=2314526 ←Was data loss from the beginning of the case.

      https://bugzilla.redhat.com/show_bug.cgi?id=2294980 ←This was original BZ we linked to, also data loss.

      This case is seeing the “Too many repaired reads” error, and it’s on all OSDs. Each OSD has the “verify_csum bad crc32c/0x1000 checksum” error in the OSD logs.

      [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 8 OSDs
      osd.7 had 20 reads repaired
      osd.6 had 27 reads repaired
      osd.1 had 16 reads repaired
      osd.0 had 20 reads repaired
      osd.2 had 21 reads repaired
      osd.3 had 17 reads repaired
      osd.4 had 19 reads repaired
      osd.5 had 26 reads repaired

      health: HEALTH_WARN
      Too many repaired reads on 8 OSDs

      services:
      mon: 5 daemons, quorum a,c,f,g,i (age 3d)
      mgr: b(active, since 3d), standbys: a
      mds: 1/1 daemons up, 1 hot standby
      osd: 8 osds: 8 up (since 3d), 8 in (since 16M)
      rgw: 2 daemons active (2 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 12 pools, 305 pgs
      objects: 1.41M objects, 4.2 TiB
      usage: 16 TiB used, 31 TiB / 47 TiB avail
      pgs: 305 active+clean

      io:
      client: 70 KiB/s rd, 42 MiB/s wr, 34 op/s rd, 1.88k op/s wr

      2024-09-25T14:16:58.138234712Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
      2024-09-25T14:16:58.138823134Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
      2024-09-25T14:16:58.139469139Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
      2024-09-25T14:16:58.140165055Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
      2024-09-25T14:17:20.386342330Z debug 2024-09-25T14:17:20.375+0000 7fffaa4dc6f0 4 rocksdb: [db_impl/db_impl_write.cc:1234] Flushing all column families with data in WAL number 656782. Total log size is 1073743416 while max_total_wal_size is 1073741824

      Version of all relevant components (if applicable):

      OCP:
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.12.60 True False 8d Cluster version is 4.12.60

      ODF:

      NAME DISPLAY VERSION REPLACES PHASE
      amq-broker-operator.v7.11.7-opr-1-1726157430 Red Hat Integration - AMQ Broker for RHEL 8 (Multiarch) 7.11.7-opr-1-1726157430 amq-broker-operator.v7.11.6-opr-2 Succeeded
      compliance-operator.v1.5.1 Compliance Operator 1.5.1 compliance-operator.v1.4.0 Succeeded
      elasticsearch-operator.v5.8.12 OpenShift Elasticsearch Operator 5.8.12 elasticsearch-operator.v5.7.7 Succeeded
      mcg-operator.v4.12.14-rhodf NooBaa Operator 4.12.14-rhodf mcg-operator.v4.11.13 Succeeded
      ocs-operator.v4.12.14-rhodf OpenShift Container Storage 4.12.14-rhodf ocs-operator.v4.11.13 Succeeded
      odf-csi-addons-operator.v4.12.14-rhodf CSI Addons 4.12.14-rhodf odf-csi-addons-operator.v4.11.13 Succeeded
      odf-operator.v4.12.14-rhodf OpenShift Data Foundation 4.12.14-rhodf odf-operator.v4.11.13 Succeeded
      openshift-gitops-operator.v1.13.1 Red Hat OpenShift GitOps 1.13.1 openshift-gitops-operator.v1.8.6 Succeeded
      openshift-pipelines-operator-rh.v1.9.3 Red Hat OpenShift Pipelines 1.9.3 openshift-pipelines-operator-rh.v1.8.2 Succeeded

      Ceph:

      {
      "mon":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 5 }

      ,
      "mgr":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

      ,
      "osd":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 8 }

      ,
      "mds":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

      ,
      "rgw":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

      ,
      "overall":

      { "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 19 }

      }

      Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

      With what we saw in previous cases, we’re extremely concerned with data loss.

      Is there any workaround available to the best of your knowledge?

      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      5

              rh_sjohn Sudeesh John
              rhn-support-crwayman Craig Wayman
              Miguel Duaso, Raimund Sacherer, Sudeesh John
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: