-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.12
-
None
Description of problem (please be detailed as possible and provide log
snippests):
- We noticed that multiple osds were down in the ceph cluster due to a known bug with in ODF 4.12[1]. This bug only occurs when the ODF cluster becomes full.
[1] https://access.redhat.com/solutions/6999849
[ANALYSIS]
- We took the following actions to bring the osds back up:
- We patched the OSD Deployments to remove the initContainer expand-bluefs on the osds that were down:
~~~
$ for dpl in 1 18 38 41; do oc patch deploymentn openshift-storage rook-ceph-osd$dpl --type=json -p='[ {"op": "remove", "path": "/spec/template/spec/initContainers/X"}]'; done
~~~ - We set the bluefs_shared_alloc_size value to 16384 on the osds that were down:
~~~
ceph config set osd.id bluefs_shared_alloc_size 16384
~~~ - We also scaled down the rook-ceph and ocs operators while this work was done
- We also set all osd deployments with the following label so the rook-ceph-operator wouldn't stop on our `bluefs_shared_alloc_size` config setting when it gets scaled up:
$ oc label deployment rook-ceph-<osd_id> ceph.rook.io/do-not-reconcile=<osd_id> -n openshift-storage
- All of the osds are now restarting frequently (but not Crashlooping anymore). The pods with the highest amount of restarts all have the following error message:
"assert_condition": "bl.length() <= runway",
- So we set the follow config setting on osds 18,24,36,8,42,4,38 to hopefully stop the pods from restarting:
osd.8 advanced bluefs_max_log_runway 8388608
osd.8 advanced bluefs_min_log_runway 4194304
- This still leaves us with most of the pods in the cluster restarting periodically, but hopefully the ones that are restarting constantly can remain stable with these settings
- This is all an effort to get the stable prior to the customer upgrading OCP and ODF to 4.14 in the near future (Oct.9 for 4.12 -> 4.13; Oct.17 for 4.13 -> 4.14)
Version of all relevant components (if applicable):
ODF 4.12
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all of the osd pods keeping restating
Is there any workaround available to the best of your knowledge?
Redeploy all of the osds
- external trackers