-
Bug
-
Resolution: Won't Do
-
Critical
-
None
-
Logging 5.6.z
-
False
-
None
-
False
-
NEW
-
NEW
-
-
Bug Fix
-
Proposed
-
-
-
Critical
Description of problem:
All users have defined spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector if spec.collection.type, is present, then, the values for spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector are never considered. This can be caused by:
- bug
LOG-4086 - or the workaround applied for RHOL 5.5.3 in
LOG-3049 - or for any other reason where defined spec.collection.type, but the real definition of the collector remains below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector __
As an example of this, before bug LOG-4086 , when it was defined as documented the collector resources, tolerations and nodeSelector as:
spec:
collection:
logs:
fluentd:
resources:
limits:
memory: 500Mi
requests:
cpu: 200m
memory: 500Mi
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/worker-hp
operator: Exists
type: fluentd
The collectors pods were applying these definitions for the running collectors.
With bug LOG-4086 consequence of the introduction of a new style not documented already in the examples and reflected in bug OBSDOCS-79 a new line was introduced automatically for all users installing new 5.7.0 as documented or upgrading from previous RHOL versions. This line is:
spec: collection: logs: fluentd: resources: limits: memory: 500Mi requests: cpu: 200m memory: 500Mi tolerations: - effect: NoSchedule key: node-role.kubernetes.io/worker-hp operator: Exists type: fluentd type: vector <-- this line referring to the new style was introduced causing the crashlooping of the collectors
For fixing the crashloopback of the collectors, it was shared as a workaround to change `type: vector` for `type: fluentd` as follows:
spec: collection: logs: fluentd: resources: limits: memory: 500Mi requests: cpu: 200m memory: 500Mi tolerations: - effect: NoSchedule key: node-role.kubernetes.io/worker-hp operator: Exists type: fluentd type: fluentd <-- this
Even, the fix for bug LOG-4086 was never deleting the new line introduced related to the new style of definition for the collector and this has important consequences for all the customers:
- They were installing new 5.7.0 using the official documentation for defining the resources, tolerations and nodeSelector
- All customers upgrading from previous versions passing through 5.7.0
- Applying workaround for
LOG-3049 - Any other reason/bug where spec.collection.type was set, but the resources, nodeSelector or tolerations definitions remain below spec.collection.logs.fluentd.resources, or spec.collection.logs.fluentd.tolerations and/or spec.collection.logs.fluentd.nodeSelector
The consequences are as the old style and the new style are together, all the resources, tolerations and nodeSelectors don't work and they are not applied to cause that without being aware until the end user detects:
- collectors restarting for not having enough resources or reading/sending logs slowly
- missing logs for entire nodes because the tolerations are not more in the collector definition, then, no collectors running in those nodes where defined custom taints for the nodes
- collector pods running in not desired nodes: some customer desire only collect logs from some nodes and not from all
Version-Release number of selected component (if applicable):
RHOL 5.5, 5.6 and 5.7.
How reproducible:
Always
Steps to Reproduce:
1. Intall RHOL in a previous version of RHOL 5.7.0
2. Define resources, tolerations and nodeSelector as it's documented
spec:
collection:
logs:
fluentd:
resources:
limits:
memory: 500Mi
requests:
cpu: 200m
memory: 500Mi
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/worker-hp
operator: Exists
type: fluentd
2. Verify that the daemonset and the fluentd pods have this definitions
$ oc get ds collector -n openshift-logging -o yaml ... resources: limits: memory: 500Mi requests: cpu: 200m memory: 500Mi ... tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists - effect: NoSchedule key: node-role.kubernetes.io/worker-hp operator: Exists
3. Upgrade to RHOL 5.7.0 hitting LOG-4086. And after that, upgrade to RHOL 5.7.1 or directly upgrade to posterior version of 5.7.0 and edit the clusterlogging instace definition to simulate the behaviour introduced by LOG-4086 even when having the fix (the fix was never deleting the new line introduced with the new style for the users going through 5.7.0, only avoiding the introducing the new line for people upgrading from 5.6 to 5.7.1 or directly installing 5.7.1) and/or applied the workaround:
spec: collection: logs: fluentd: resources: limits: memory: 500Mi requests: cpu: 200m memory: 500Mi tolerations: - effect: NoSchedule key: node-role.kubernetes.io/worker-hp operator: Exists type: fluentd type: fluentd <---- add this
Confirm that the collectors are not having more the resources, nodeSelectors and tolerations defined:
$ oc get ds collector -o yaml ... resources: <----------- the values are resetted to the default, not the defined in the clusterlogging instance limits: memory: 736Mi requests: cpu: 100m memory: 736Mi ... tolerations: <------- the values are resetted to the default, not containing the defined in the clusterlogging instance - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists
Actual results:
resources, tolerations and nodeSelector are all of them missed from collector definition causing:
- performance issues
- collector pods running in nodes where they shouldn't impacting to the performance of the LogStores defined and economical impact since needed to use more store
- collector pods not running in nodes with taints causing logs from those nodes never being collected
Expected results:
It's respected the definition of the resources, tolerations and nodeSelector in the clusterlogging instance as it was working in the past and it's expected as per documentation.
Workaround:
Delete the new line introduced by LOG-4086, but the problem introduced is not causing any alerts until the customers are aware of the problems described in the section Actual results, then, in that exact moment that's reported, it's so late and the consequences are there. The workaround should be changed:
spec: collection: logs: fluentd: resources: limits: memory: 500Mi requests: cpu: 200m memory: 500Mi tolerations: - effect: NoSchedule key: node-role.kubernetes.io/worker-hp operator: Exists type: fluentd type: fluentd <---- needed to delete this line
By:
spec:
collection:
logs:
fluentd:
resources:
limits:
memory: 500Mi
requests:
cpu: 200m
memory: 500Mi
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/worker-hp
operator: Exists
type: fluentd
Or move the definition of the collector using the new style available in the article: https://access.redhat.com/solutions/6999814 **
- clones
-
LOG-4185 Resources, tolerations and nodeSelector for the collector are missing
- Closed