Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.16
Component/s: ceph-csi-operator, Multi-Cloud Object Gateway, ocs-client-operator, ocs-operator, odf-operator, rook
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2293062
Dev Approval:
?
Prod build version:
4.19.0-41.konflux
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.21
Intelligence Requested:
Market:

Release Blocker:
Proposed

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

The following storage-related processes currently share the same priority as the virt-launcher pods, making them susceptible to eviction or termination during pod evacuation or Out Of Memory (OOM) events:

csi-addons-controller-manager
noobaa-operator
ocs-metrics-exporter
ocs-operator
odf-console
odf-operator-controller
rook-ceph-crashcollector
rook-ceph-exporter
rook-ceph-operator
rook-ceph-tools
ux-backend-server

Given the important role as some of these processes play in the storage system, it is worth considering elevating their priority class.
Enhancing their priority would improve the stability and robustness of the storage system during periods of stress, ensure continued operation during critical scenarios, and facilitate system debugging and information gathering in the event of crashes.

Version of all relevant components (if applicable):
All

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Loss of these pods in a stress scenario may impact the ability to gather storage information or monitoring the system effectively.

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Install ODF 4.15
2. Stress the system to the point of node evacuation with high number of VMS

Actual results:
These pods are killed by OOM/evacuated

Expected results:
All important storage pods, also those who are important for debuting and monitoring, should not be evacuated/killed before all VMS are.

Additional info:

links to

red-hat-storage/ceph-csi-operator#101: bundle: add priorityClassName to csv deployments

red-hat-storage/ceph-csi-operator#143: DFBUGS-394:[release-4.20] bundle: add priorityClassName to csv deployments

red-hat-storage/ocs-client-operator#344: bundle: add priorityClassName to csv deployments

red-hat-storage/ocs-client-operator#457: DFBUGS-394:[release-4.20] bundle: add priorityClassName to csv deployments

red-hat-storage/ocs-client-operator#465: bundle: update priorityClassName in csv deployments

red-hat-storage/ocs-client-operator#466: DFBUGS-394:[release-4.20] bundle: update priorityClassName in csv deployments

red-hat-storage/ocs-operator#3151: add priorityClassName under deployments

red-hat-storage/ocs-operator#3158: DFBUGS-394:[release-4.19] add priorityClassName under deployments

red-hat-storage/ocs-operator#3465: promote all openshift-user-critical pods to system-cluster-critical

red-hat-storage/ocs-operator#3468: DFBUGS-394:[release-4.20] promote all openshift-user-critical pods to system-cluster-critical