Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: Logging 5.6.z
Component/s: Log Collection
Labels:
- devel_ack+

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:

Hide
A fix prior to this change to remove defaulting of the collection.type resulted in the operator no longer honoring the deprecated spec for resource, nodeSelector, and tolerations. This modifies the operator behavior to always prefer prefer "collection.logs" spec over those of "collection". Note this varies from previous behavior that allowed using both the preferred fields and deprecated fields but would ignore the deprecated fields when "collection.type" is populated.

Show
A fix prior to this change to remove defaulting of the collection.type resulted in the operator no longer honoring the deprecated spec for resource, nodeSelector, and tolerations. This modifies the operator behavior to always prefer prefer "collection.logs" spec over those of "collection". Note this varies from previous behavior that allowed using both the preferred fields and deprecated fields but would ignore the deprecated fields when "collection.type" is populated.
Release Note Type:
Bug Fix
Release Note Status:
Proposed
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

All users have defined spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector if spec.collection.type, is present, then, the values for spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector are never considered. This can be caused by:

bug ~~LOG-4086~~
or the workaround applied for RHOL 5.5.3 in ~~LOG-3049~~
or for any other reason where defined spec.collection.type, but the real definition of the collector remains below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector __

As an example of this, before bug ~~LOG-4086~~ , when it was defined as documented the collector resources, tolerations and nodeSelector as:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd

The collectors pods were applying these definitions for the running collectors.

With bug ~~LOG-4086~~ consequence of the introduction of a new style not documented already in the examples and reflected in bug ~~OBSDOCS-79~~ a new line was introduced automatically for all users installing new 5.7.0 as documented or upgrading from previous RHOL versions. This line is:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd
    type: vector <-- this line referring to the new style was introduced causing the crashlooping of the collectors

For fixing the crashloopback of the collectors, it was shared as a workaround to change `type: vector` for `type: fluentd` as follows:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd
    type: fluentd <-- this

Even, the fix for bug ~~LOG-4086~~ was never deleting the new line introduced related to the new style of definition for the collector and this has important consequences for all the customers:

They were installing new 5.7.0 using the official documentation for defining the resources, tolerations and nodeSelector
All customers upgrading from previous versions passing through 5.7.0
Applying workaround for ~~LOG-3049~~
Any other reason/bug where spec.collection.type was set, but the resources, nodeSelector or tolerations definitions remain below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector

The consequences are as the old style and the new style are together, all the resources, tolerations and nodeSelectors don't work and they are not applied to cause that without being aware until the end user detects:

collectors restarting for not having enough resources or reading/sending logs slowly
missing logs for entire nodes because the tolerations are not more in the collector definition, then, no collectors running in those nodes where defined custom taints for the nodes
collector pods running in not desired nodes: some customer desire only collect logs from some nodes and not from all

Version-Release number of selected component (if applicable):

RHOL 5.5, 5.6 and 5.7.

How reproducible:

Always

Steps to Reproduce:

1. Intall RHOL in a previous version of RHOL 5.7.0
2. Define resources, tolerations and nodeSelector as it's documented

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd

2. Verify that the daemonset and the fluentd pods have this definitions

$ oc get ds collector -n openshift-logging -o yaml
...
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
...
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoSchedule
        key: node.kubernetes.io/disk-pressure
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/worker-hp
        operator: Exists

3. Upgrade to RHOL 5.7.0 hitting ~~LOG-4086~~. And after that, upgrade to RHOL 5.7.1 or directly upgrade to posterior version of 5.7.0 and edit the clusterlogging instace definition to simulate the behaviour introduced by ~~LOG-4086~~ even when having the fix (the fix was never deleting the new line introduced with the new style for the users going through 5.7.0, only avoiding the introducing the new line for people upgrading from 5.6 to 5.7.1 or directly installing 5.7.1) and/or applied the workaround:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd
    type: fluentd   <---- add this

Confirm that the collectors are not having more the resources, nodeSelectors and tolerations defined:

$ oc get ds collector -o yaml 
...
        resources:    <----------- the values are resetted to the default, not the defined in the clusterlogging instance
          limits:
            memory: 736Mi
          requests:
            cpu: 100m
            memory: 736Mi
 ...
      tolerations:       <------- the values are resetted to the default, not containing the defined in the clusterlogging instance
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoSchedule
        key: node.kubernetes.io/disk-pressure
        operator: Exists

Actual results:

resources, tolerations and nodeSelector are all of them missed from collector definition causing:

performance issues
collector pods running in nodes where they shouldn't impacting to the performance of the LogStores defined and economical impact since needed to use more store
collector pods not running in nodes with taints causing logs from those nodes never being collected

Expected results:

It's respected the definition of the resources, tolerations and nodeSelector in the clusterlogging instance as it was working in the past and it's expected as per documentation.

Workaround:

Delete the new line introduced by ~~LOG-4086~~, but the problem introduced is not causing any alerts until the customers are aware of the problems described in the section Actual results, then, in that exact moment that's reported, it's so late and the consequences are there. The workaround should be changed:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd
    type: fluentd  <---- needed to delete this line

By:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 500Mi
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker-hp
          operator: Exists
      type: fluentd

Or move the definition of the collector using the new style available in the article: https://access.redhat.com/solutions/6999814 **

clones

LOG-4185 Resources, tolerations and nodeSelector for the collector are missing

Closed

Assignee:: Jeffrey Cantrill

Reporter:: Oscar Casal Sanchez

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/06/05 6:48 PM

Updated:: 2023/08/31 1:39 PM

Resolved:: 2023/07/13 6:09 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Workaround:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates