Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-524

[2317658] assert_condition": "bl.length() <= runway error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.12
    • ceph/RADOS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      [ANALYSIS]

      • We took the following actions to bring the osds back up:
      • We patched the OSD Deployments to remove the initContainer expand-bluefs on the osds that were down:
        ~~~
        $ for dpl in 1 18 38 41; do oc patch deployment n openshift-storage rook-ceph-osd$dpl --type=json -p='[ {"op": "remove", "path": "/spec/template/spec/initContainers/X"}

        ]'; done
        ~~~

      • We set the bluefs_shared_alloc_size value to 16384 on the osds that were down:
        ~~~
        ceph config set osd.id bluefs_shared_alloc_size 16384
        ~~~
      • We also scaled down the rook-ceph and ocs operators while this work was done
      • We also set all osd deployments with the following label so the rook-ceph-operator wouldn't stop on our `bluefs_shared_alloc_size` config setting when it gets scaled up:
        $ oc label deployment rook-ceph-<osd_id> ceph.rook.io/do-not-reconcile=<osd_id> -n openshift-storage
      • All of the osds are now restarting frequently (but not Crashlooping anymore). The pods with the highest amount of restarts all have the following error message:

      "assert_condition": "bl.length() <= runway",

      • So we set the follow config setting on osds 18,24,36,8,42,4,38 to hopefully stop the pods from restarting:
        osd.8 advanced bluefs_max_log_runway 8388608
        osd.8 advanced bluefs_min_log_runway 4194304
      • This still leaves us with most of the pods in the cluster restarting periodically, but hopefully the ones that are restarting constantly can remain stable with these settings
      • This is all an effort to get the stable prior to the customer upgrading OCP and ODF to 4.14 in the near future (Oct.9 for 4.12 -> 4.13; Oct.17 for 4.13 -> 4.14)

      Version of all relevant components (if applicable):
      ODF 4.12

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      Yes, all of the osd pods keeping restating

      Is there any workaround available to the best of your knowledge?

      Redeploy all of the osds

              rhn-support-pdhange Prashant Dhange
              rhn-support-bmcmurra Brandon McMurray
              Brandon McMurray
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: