Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-428

[2273039] Storage nodes run out of capacity for ceph osds when a node-role with the `infra` is applied

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.16
    • odf-4.12
    • Documentation
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Storage nodes run out of capacity for ceph osds when a node role with the `infra` is applied causing infra related components to run on these nodes which take up additional CPU/memory resources

      Version of all relevant components (if applicable):

      4.12+

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Customer applies the `infra` node role to the storage worker nodes following best practices so these nodes do not count towards their OCP subscription/entitlement per [1]

      [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/managing_and_allocating_storage_resources/index#how-to-use-dedicated-worker-nodes-for-openshift-data-foundation_rhodf

      Is there any workaround available to the best of your knowledge?

      None

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      2

      Can this issue reproducible?

      Always

      Can this issue reproduce from the UI?

      Unsure

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1. Configure ODF and a Storage system and projects using the Ceph luns
      2. Setup node labels to use the `infra` role on the storage nodes
      3. Setup the default router pods to run only on `infra` nodes, migrate the router pod workloads to the storage nodes
      4. Bump up the traffic going to the router pods which take up more CPU/memory from other pods that are running the ODF storage components, Ceph OSDs for example, causing instability to ODF

      Actual results:

      Ceph becomes unstable, dropping OSDs since it does not have enough CPU/memory to

      Expected results:

      Include instructions on how to not run any additional `infra` workloads when this node-role is applied to storage nodes.

      Additional info:

      Quick workaround would be to remove the `infra` node-role on these storage nodes so any `infra` related workload will not run on the storage nodes. Would this be a possible supported and documented option other than having the customer set up taints/tolerations across all `infra` related components?

              asriram@redhat.com Anjana Sriram
              rh-ee-syangsao Sam Yangsao
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: