Loading...

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: rook
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2319878
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.16.4
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

We noticed that some new crush rules get pushed out with an update to ODF 4.16:

.mgr_host_ssd
ocs-storagecluster-cephblockpool_host_ssd
.rgw.root_host_ssd
ocs-storagecluster-cephobjectstore.rgw.otp_host_ssd
ocs-storagecluster-cephobjectstore.rgw.meta_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.index_host_ssd
ocs-storagecluster-cephobjectstore.rgw.control_host_ssd
ocs-storagecluster-cephobjectstore.rgw.log_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.data_host_ssd
ocs-storagecluster-cephfilesystem-metadata_host_ssd
ocs-storagecluster-cephfilesystem-data0_host_ssd

This has resulted in at least 1 case of a big rebalance after updating ODF:

cluster:
id: c745f785-45cc-4c32-b62d-67fd61b87321
health: HEALTH_WARN
1 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 7 pgs backfill_toofull
12 pool(s) nearfull
1 daemons have recently crashed

services:
mon: 3 daemons, quorum a,c,e (age 13h)
mgr: a(active, since 13h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 12 osds: 12 up (since 13h), 12 in (since 19M); 125 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 281 pgs
objects: 8.31M objects, 8.9 TiB
usage: 27 TiB used, 21 TiB / 48 TiB avail
pgs: 12427371/24917634 objects misplaced (49.874%)
156 active+clean
116 active+remapped+backfill_wait
7 active+remapped+backfill_wait+backfill_toofull
2 active+remapped+backfilling

Here's the rook-ceph-operator logs during the event:

2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807746 I | cephclient: creating a new crush rule for changed failure domain ("host"-->"rack") on crush rule "replicated_rule"
2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807771 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "replicated_rule"
2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842722 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephblockpool_rack"
2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441216 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule ".rgw.root_rack"
2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456778 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack"
2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458423 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack"
2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464360 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack"
2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470228 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack"
2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477762 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack"
2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484742 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack"
2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696712 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack"
2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628941 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-metadata_rack"
2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.195994 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-data0_rack"
2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807779 I | cephclient: crush rule "replicated_rule" will no longer be used by pool ".mgr"
2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842744 I | cephclient: crush rule "ocs-storagecluster-cephblockpool_rack" will no longer be used by pool "ocs-storagecluster-cephblockpool"
2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441238 I | cephclient: crush rule ".rgw.root_rack" will no longer be used by pool ".rgw.root"
2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456810 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.meta"
2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458446 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.control"
2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464385 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.otp"
2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470254 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.index"
2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477785 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.log"
2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484769 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec"
2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696736 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.data"
2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628966 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-metadata_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-metadata"
2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.196026 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-data0_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-data0"

Version of all relevant components (if applicable):
ODF 4.16

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, it temporarily degrades the ODF cluster until backfilling completes

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

I haven't been able to reproduce it in the lab yet, but have tried once and will be trying again.

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

No

Actual results:
Ceph has to go through a rebalance after upgrading to ODF 4.16

Expected results:
Ceph doesn't have to go through a rebalance after upgrading to ODF 4.16

Additional info:
If engineering can shed some insight on a) whether the rebalance is expected behaviour and b) if it is expected offer some solution to mitigate this interruption that would be very helpful. This is my main question/concern.

Relevant Attachments in supporthshell:
[bmcmurra@supportshell-1 03961471]$ ll
total 132
drwxrwxrwx+ 3 yank yank 54 Oct 16 16:02 0010-inspect-lbs-i2l.tar.gz
drwxrwxrwx+ 3 yank yank 55 Oct 16 16:02 0020-inspect-openshift-storage.tar.gz
~~rw-rw-rw~~+ 1 yank yank 112257 Oct 16 17:11 0030-image.png
drwxrwxrwx+ 3 yank yank 59 Oct 16 17:24 0040-must-gather-openshift-logging.tar.gz
drwxrwxrwx+ 3 yank yank 59 Oct 17 18:22 0050-must-gather-openshift-storage.tar.gz

Let me know if you require anymore data than what's already in supportshell.

Thanks

Brandon McMurray
Technical Support Engineer, RHCE
Software Defined Storage and Openshift Data Foundation

external trackers

Github red-hat-storage/rook/pull/768

Red Hat Customer Portal 03961471

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty