Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-394

[2293062] Storage pods have low priority making them vulnerable during high stress

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.16
    • odf-operator
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      The following storage-related processes currently share the same priority as the virt-launcher pods, making them susceptible to eviction or termination during pod evacuation or Out Of Memory (OOM) events:

      csi-addons-controller-manager
      noobaa-operator
      ocs-metrics-exporter
      ocs-operator
      odf-console
      odf-operator-controller
      rook-ceph-crashcollector
      rook-ceph-exporter
      rook-ceph-operator
      rook-ceph-tools
      ux-backend-server

      Given the important role as some of these processes play in the storage system, it is worth considering elevating their priority class.
      Enhancing their priority would improve the stability and robustness of the storage system during periods of stress, ensure continued operation during critical scenarios, and facilitate system debugging and information gathering in the event of crashes.

      Version of all relevant components (if applicable):
      All

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Loss of these pods in a stress scenario may impact the ability to gather storage information or monitoring the system effectively.

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      1

      Can this issue reproducible?
      yes

      Can this issue reproduce from the UI?
      yes

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Install ODF 4.15
      2. Stress the system to the point of node evacuation with high number of VMS

      Actual results:
      These pods are killed by OOM/evacuated

      Expected results:
      All important storage pods, also those who are important for debuting and monitoring, should not be evacuated/killed before all VMS are.

      Additional info:

              nigoyal Nitin Goyal
              guchen11 Guy Chen
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: