Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-4185

Resources, tolerations and nodeSelector for the collector are missing

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • Logging 5.7.2
    • Logging 5.6.z, Logging 5.7.z, Logging 5.5.z
    • Log Collection
    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Hide
      A fix prior to this change to remove defaulting of the collection.type resulted in the operator no longer honoring the deprecated spec for resource, nodeSelector, and tolerations. This modifies the operator behavior to always prefer prefer "collection.logs" spec over those of "collection". Note this varies from previous behavior that allowed using both the preferred fields and deprecated fields but would ignore the deprecated fields when "collection.type" is populated.
      Show
      A fix prior to this change to remove defaulting of the collection.type resulted in the operator no longer honoring the deprecated spec for resource, nodeSelector, and tolerations. This modifies the operator behavior to always prefer prefer "collection.logs" spec over those of "collection". Note this varies from previous behavior that allowed using both the preferred fields and deprecated fields but would ignore the deprecated fields when "collection.type" is populated.
    • Bug Fix
    • Proposed
    • Log Collection - Sprint 237
    • Critical

    Description

      Description of problem:

      All users have defined spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector if spec.collection.type, is present, then, the values for spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector are never considered. This can be caused by:

      • bug LOG-4086
      • or the workaround applied for RHOL 5.5.3 in LOG-3049
      • or for any other reason where defined spec.collection.type, but the real definition of the collector remains below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector __

      As an example of this, before bug LOG-4086 , when it was defined as documented the collector resources, tolerations and nodeSelector as:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd

      The collectors pods were applying these definitions for the running collectors.

      With bug LOG-4086 consequence of the introduction of a new style not documented already in the examples and reflected in bug OBSDOCS-79 a new line was introduced automatically for all users installing new 5.7.0 as documented or upgrading from previous RHOL versions. This line is:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd
          type: vector <-- this line referring to the new style was introduced causing the crashlooping of the collectors

      For fixing the crashloopback of the collectors, it was shared as a workaround to change `type: vector` for `type: fluentd` as follows:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd
          type: fluentd <-- this

      Even, the fix for bug LOG-4086 was never deleting the new line introduced related to the new style of definition for the collector and this has important consequences for all the customers:

      • They were installing new 5.7.0 using the official documentation for defining the resources, tolerations and nodeSelector
      • All customers upgrading from previous versions passing through 5.7.0
      • Applying workaround for  LOG-3049
      • Any other reason/bug where spec.collection.type was set, but the resources, nodeSelector or tolerations definitions remain below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector

      The consequences are as the old style and the new style are together, all the resources, tolerations and nodeSelectors don't work and they are not applied to cause that without being aware until the end user detects:

      • collectors restarting for not having enough resources or reading/sending logs slowly
      • missing logs for entire nodes because the tolerations are not more in the collector definition, then, no collectors running in those nodes where defined custom taints for the nodes
      • collector pods running in not desired nodes: some customer desire only collect logs from some nodes and not from all

       

      Version-Release number of selected component (if applicable):

      RHOL 5.5, 5.6 and 5.7.

      How reproducible:

      Always

      Steps to Reproduce:

      1. Intall RHOL in a previous version of RHOL 5.7.0
      2. Define  resources, tolerations and nodeSelector as it's documented

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd

      2. Verify that the daemonset and the fluentd pods have this definitions

      $ oc get ds collector -n openshift-logging -o yaml
      ...
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
      ...
            tolerations:
            - effect: NoSchedule
              key: node-role.kubernetes.io/master
              operator: Exists
            - effect: NoSchedule
              key: node.kubernetes.io/disk-pressure
              operator: Exists
            - effect: NoSchedule
              key: node-role.kubernetes.io/worker-hp
              operator: Exists
       

      3. Upgrade to RHOL 5.7.0 hitting LOG-4086. And after that, upgrade to RHOL 5.7.1 or directly upgrade to posterior version of 5.7.0 and edit the clusterlogging instace definition to simulate the behaviour introduced by LOG-4086 even when having the fix (the fix was never deleting the new line introduced with the new style for the users going through 5.7.0, only avoiding the introducing the new line for people upgrading from 5.6 to 5.7.1 or directly installing 5.7.1) and/or applied the workaround:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd
          type: fluentd   <---- add this
      

      Confirm that the collectors are not having more the resources, nodeSelectors and tolerations defined:

      $ oc get ds collector -o yaml 
      ...
              resources:    <----------- the values are resetted to the default, not the defined in the clusterlogging instance
                limits:
                  memory: 736Mi
                requests:
                  cpu: 100m
                  memory: 736Mi
       ...
            tolerations:       <------- the values are resetted to the default, not containing the defined in the clusterlogging instance
            - effect: NoSchedule
              key: node-role.kubernetes.io/master
              operator: Exists
            - effect: NoSchedule
              key: node.kubernetes.io/disk-pressure
              operator: Exists
      

      Actual results:

      resources, tolerations and nodeSelector are all of them missed from collector definition causing:

      • performance issues
      • collector pods running in nodes where they shouldn't impacting to the performance of the LogStores defined and economical impact since needed to use more store
      • collector pods not running in nodes with taints causing logs from those nodes never being collected

      Expected results:

      It's respected the definition of the resources, tolerations and nodeSelector in the clusterlogging instance as it was working in the past and it's expected as per documentation.

      Workaround:

      Delete the new line introduced by LOG-4086, but the problem introduced is not causing any alerts until the customers are aware of the problems described in the section Actual results, then, in that exact moment that's reported, it's so late and the consequences are there. The workaround should be changed:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd
          type: fluentd  <---- needed to delete this line

      By:

      spec:
        collection:
          logs:
            fluentd:
              resources:
                limits:
                  memory: 500Mi
                requests:
                  cpu: 200m
                  memory: 500Mi
              tolerations:
              - effect: NoSchedule
                key: node-role.kubernetes.io/worker-hp
                operator: Exists
            type: fluentd
      

      Or move the definition of the collector using the new style available in the article: https://access.redhat.com/solutions/6999814 **

      Attachments

        Activity

          People

            jcantril@redhat.com Jeffrey Cantrill
            rhn-support-ocasalsa Oscar Casal Sanchez
            Qiaoling Tang Qiaoling Tang
            Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: