Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: odf-4.16
Component/s: ocs-operator
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
?
Docs Approval:
?
PM Approval:
?
QE Approval:
?
Target Release:

odf-4.18
Intelligence Requested:
Market:

Test Coverage:

-
Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

The problem is this new affinity rule in ODF 4.16, ODF 4.15 was fine and this problem only occurred after the upgrade.

In summary the new new nodeAffinity introduced in ODF 4.16 has created a bug that stops ceph tools from running. We have proved this by manually (temporarily as the operator puts it all back) editing the deployment config for ceph tools. This then creates a second replica sets. We removed the new nodeAffinity rules for cephtools and meant it could run on an application node. This is consistent with the ceph tools behaviour in OCP 4.15 and OCP 4.15.

There is a preferred way to run ceph CLI. It is the ceph tools POD.

In [1] there is a work around to use the operator POD as ceph tools pod. It involves extra manual steps which are fine but the preference is not to have to do these when it is possible to avoid them with ceph tools pod. The key sentence from [1] is "The preferred method to accomplish this task is detailed in KCS article #4628891". This is article [2].

In article [2] the answer to run ceph tools for ODF 4.15 and above no longer works due to new nodeAffinity introduced in ODF 4.16. The answer from [2] is shown below >>>

>>> ODF v4.15 and above To enable the toolbox pod, patch/edit the StorageCluster CR like below:
>>> oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[

{ "op": "replace", "path": "/spec/enableCephTools", "value": true }

The introduction of nodeAffinity has in ODF 4.16 has resulted in this bug. The new nodeAffinity yaml is shown in the yaml below.

spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cluster.ocs.openshift.io/openshift-storage
operator: Exists

ceph tools is required for difficult ceph support cased and more importantly for things like node maintenance. See [3] below. [3] refers to [1] which in turn refers to [2] which means ceph tools is required. We have upcoming firmware updates on all nodes and we will need to reboot infra (ceph) nodes. This will need to be done in a graceful manner to avoid corruption or data loss in ceph. To check/verify ODF status ceph tools is required.

Can we please provide a more elegant work around than manually editing the deployment config and having two PODs. Is it possible to fix via the Operators?

[1] https://access.redhat.com/articles/4870821 # Accessing the Red Hat Ceph Storage CLI in OpenShift Data Foundation 4.x
[2] https://access.redhat.com/articles/4628891 # Configuring the Rook-Ceph Toolbox in OpenShift Data Foundation 4.x

[3] https://access.redhat.com/solutions/6495631 # How to safely reboot an OCS/ODF 4 node

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

IPI Baremetal

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal running on OCP

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

OCP 4.16.17

ODF 4.16.16

Does this issue impact your ability to continue to work with the product?

Not cirectly no but does cause problems with some managment functions

Is there any workaround available to the best of your knowledge?

Yes, but it is no ideal (see details above)

Can this issue be reproduced? If so, please provide the hit rate

Unsure as yet

Can this issue be reproduced from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

The exact date and time when the issue was observed, including timezone details:

Actual results:

Expected results:

Logs collected and log location:

Additional info:

Assignee:: Malay Kumar Parida

Reporter:: Andy Bartlett

QA Contact:: Nagendra Reddy (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Created:: 2024/11/07 3:59 PM

Updated:: 2025/09/13 9:51 PM

Resolved:: 2025/01/08 4:14 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty