Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.12
Component/s: ceph/RADOS/ppc64
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2314917
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.21
Intelligence Requested:
Market:

Release Blocker:
Proposed

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log snippets):

Below are the related bugs. This will most likely be marked duplicate.

https://bugzilla.redhat.com/show_bug.cgi?id=2314526 ←Was data loss from the beginning of the case.

https://bugzilla.redhat.com/show_bug.cgi?id=2294980 ←This was original BZ we linked to, also data loss.

This case is seeing the “Too many repaired reads” error, and it’s on all OSDs. Each OSD has the “verify_csum bad crc32c/0x1000 checksum” error in the OSD logs.

[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 8 OSDs
osd.7 had 20 reads repaired
osd.6 had 27 reads repaired
osd.1 had 16 reads repaired
osd.0 had 20 reads repaired
osd.2 had 21 reads repaired
osd.3 had 17 reads repaired
osd.4 had 19 reads repaired
osd.5 had 26 reads repaired

health: HEALTH_WARN
Too many repaired reads on 8 OSDs

services:
mon: 5 daemons, quorum a,c,f,g,i (age 3d)
mgr: b(active, since 3d), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 8 osds: 8 up (since 3d), 8 in (since 16M)
rgw: 2 daemons active (2 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 305 pgs
objects: 1.41M objects, 4.2 TiB
usage: 16 TiB used, 31 TiB / 47 TiB avail
pgs: 305 active+clean

io:
client: 70 KiB/s rd, 42 MiB/s wr, 34 op/s rd, 1.88k op/s wr

2024-09-25T14:16:58.138234712Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
2024-09-25T14:16:58.138823134Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
2024-09-25T14:16:58.139469139Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
2024-09-25T14:16:58.140165055Z debug 2024-09-25T14:16:58.126+0000 7fff9aaec6f0 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0x9801c8c8, device location [0x1ac7c5a2000~1000], logical extent 0x3d0000~1000, object #2:0500c813:::rbd_data.b8f1e1d32e1d85.000000000000acff:head#
2024-09-25T14:17:20.386342330Z debug 2024-09-25T14:17:20.375+0000 7fffaa4dc6f0 4 rocksdb: [db_impl/db_impl_write.cc:1234] Flushing all column families with data in WAL number 656782. Total log size is 1073743416 while max_total_wal_size is 1073741824

Version of all relevant components (if applicable):

OCP:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.60 True False 8d Cluster version is 4.12.60

ODF:

NAME DISPLAY VERSION REPLACES PHASE
amq-broker-operator.v7.11.7-opr-1-1726157430 Red Hat Integration - AMQ Broker for RHEL 8 (Multiarch) 7.11.7-opr-1-1726157430 amq-broker-operator.v7.11.6-opr-2 Succeeded
compliance-operator.v1.5.1 Compliance Operator 1.5.1 compliance-operator.v1.4.0 Succeeded
elasticsearch-operator.v5.8.12 OpenShift Elasticsearch Operator 5.8.12 elasticsearch-operator.v5.7.7 Succeeded
mcg-operator.v4.12.14-rhodf NooBaa Operator 4.12.14-rhodf mcg-operator.v4.11.13 Succeeded
ocs-operator.v4.12.14-rhodf OpenShift Container Storage 4.12.14-rhodf ocs-operator.v4.11.13 Succeeded
odf-csi-addons-operator.v4.12.14-rhodf CSI Addons 4.12.14-rhodf odf-csi-addons-operator.v4.11.13 Succeeded
odf-operator.v4.12.14-rhodf OpenShift Data Foundation 4.12.14-rhodf odf-operator.v4.11.13 Succeeded
openshift-gitops-operator.v1.13.1 Red Hat OpenShift GitOps 1.13.1 openshift-gitops-operator.v1.8.6 Succeeded
openshift-pipelines-operator-rh.v1.9.3 Red Hat OpenShift Pipelines 1.9.3 openshift-pipelines-operator-rh.v1.8.2 Succeeded

Ceph:

{
"mon":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 5 }

,
"mgr":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

,
"osd":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 8 }

,
"mds":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

,
"rgw":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 2 }

,
"overall":

{ "ceph version 16.2.10-266.el8cp (07823b29a11c047cffc11d81c3c975986573a225) pacific (stable)": 19 }

}

Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

With what we saw in previous cases, we’re extremely concerned with data loss.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

5

external trackers

Red Hat Customer Portal 03940696

Red Hat Customer Portal 03945825

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty