Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.12
Component/s: ceph/RADOS/x86
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2317658
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.19
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

We noticed that multiple osds were down in the ceph cluster due to a known bug with in ODF 4.12[1]. This bug only occurs when the ODF cluster becomes full.
[1] https://access.redhat.com/solutions/6999849

[ANALYSIS]

We took the following actions to bring the osds back up:
We patched the OSD Deployments to remove the initContainer expand-bluefs on the osds that were down:
~~~
$ for dpl in 1 18 38 41; do oc patch deployment ~~n openshift-storage rook-ceph-osd~~$dpl --type=json -p='[ {"op": "remove", "path": "/spec/template/spec/initContainers/X"}
]'; done
~~~
We set the bluefs_shared_alloc_size value to 16384 on the osds that were down:
~~~
ceph config set osd.id bluefs_shared_alloc_size 16384
~~~
We also scaled down the rook-ceph and ocs operators while this work was done

We also set all osd deployments with the following label so the rook-ceph-operator wouldn't stop on our `bluefs_shared_alloc_size` config setting when it gets scaled up:
$ oc label deployment rook-ceph-<osd_id> ceph.rook.io/do-not-reconcile=<osd_id> -n openshift-storage

All of the osds are now restarting frequently (but not Crashlooping anymore). The pods with the highest amount of restarts all have the following error message:

"assert_condition": "bl.length() <= runway",

So we set the follow config setting on osds 18,24,36,8,42,4,38 to hopefully stop the pods from restarting:
osd.8 advanced bluefs_max_log_runway 8388608
osd.8 advanced bluefs_min_log_runway 4194304

This still leaves us with most of the pods in the cluster restarting periodically, but hopefully the ones that are restarting constantly can remain stable with these settings

This is all an effort to get the stable prior to the customer upgrading OCP and ODF to 4.14 in the near future (Oct.9 for 4.12 -> 4.13; Oct.17 for 4.13 -> 4.14)

Version of all relevant components (if applicable):
ODF 4.12

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all of the osd pods keeping restating

Is there any workaround available to the best of your knowledge?

Redeploy all of the osds

external trackers

Red Hat Customer Portal 03947794

Assignee:: Prashant Dhange

Reporter:: Brandon McMurray

Need Info From:: Brandon McMurray

QA Contact:: Elad Ben Aharon

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2024/10/09 7:25 PM

Updated:: 2025/05/04 1:49 PM

Resolved:: 2025/05/04 1:49 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty