Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-791

ceph-toolbox fails to launch due to new affinity rule

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • odf-4.18
    • None
    • ceph
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      The problem is this new affinity rule in ODF 4.16, ODF 4.15 was fine and this problem only occurred after the upgrade. 

      In summary the new new nodeAffinity introduced in ODF 4.16 has created a bug that stops ceph tools from running. We have proved this by manually (temporarily as the operator puts it all back) editing the deployment config for ceph tools. This then creates a second replica sets. We removed the new nodeAffinity rules for cephtools and meant it could run on an application node. This is consistent with the ceph tools behaviour in OCP 4.15 and OCP 4.15.

      There is a preferred way to run ceph CLI. It is  the ceph tools POD. 

      In [1] there is a work around to use the operator POD as ceph tools pod. It involves extra manual steps which are fine but the preference is not to have to do these when it is possible to avoid them with ceph tools pod. The key sentence from [1] is "The preferred method to accomplish this task is detailed in KCS article #4628891". This is article [2].

      In article [2] the answer to run ceph tools for ODF 4.15 and above no longer works due to new nodeAffinity introduced in ODF 4.16. The answer from [2] is shown below >>>

      >>> ODF v4.15 and above To enable the toolbox pod, patch/edit the StorageCluster CR like below:
      >>> oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[

      { "op": "replace", "path": "/spec/enableCephTools", "value": true }

      ]'

      The introduction of nodeAffinity has in ODF 4.16 has resulted in this bug. The new nodeAffinity yaml is shown in the yaml below.

      spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                    - key: cluster.ocs.openshift.io/openshift-storage
                      operator: Exists

      ceph tools is required for difficult ceph support cased and more importantly for things like node maintenance. See [3] below. [3] refers to [1] which in turn refers to [2] which means ceph tools is required. We have upcoming firmware updates on all nodes and we will need to reboot infra (ceph) nodes. This will need to be done in a graceful manner to avoid corruption or data loss in ceph. To check/verify ODF status ceph tools is required.

      Can we please provide a more elegant work around than manually editing the deployment config and having two PODs. Is it possible to fix via the Operators?

      [1] https://access.redhat.com/articles/4870821 # Accessing the Red Hat Ceph Storage CLI in OpenShift Data Foundation 4.x
      [2] https://access.redhat.com/articles/4628891 # Configuring the Rook-Ceph Toolbox in OpenShift Data Foundation 4.x

      [3] https://access.redhat.com/solutions/6495631 #  How to safely reboot an OCS/ODF 4 node

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      IPI Baremetal

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal running on OCP

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      OCP 4.16.17

      ODF 4.16.16

       

      Does this issue impact your ability to continue to work with the product?

      Not cirectly no but does cause problems with some managment functions

      Is there any workaround available to the best of your knowledge?

      Yes, but it is no ideal (see details above)

       

      Can this issue be reproduced? If so, please provide the hit rate

      Unsure as yet

       

      Can this issue be reproduced from the UI?

      No

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1.

      2.

      3.

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

       

       

      Expected results:

       

      Logs collected and log location:

       

      Additional info:

       

              rh-ee-mrudraia Marulasiddaiah Rudraiah
              rhn-support-andbartl Andy Bartlett
              Nagendra Reddy Nagendra Reddy
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: