Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: Logging 5.5.5
Affects Version/s: Logging 5.4.z
Component/s: Log Collection
Labels:

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:
Before this change, the alert could fail to fire when there was a cardinality issue with the set of labels returned from this alert expression. This fix corrects that by reducing the labels to only include those required for the alert.

Sprint:
Log Collection - Sprint 227, Log Collection - Sprint 228
Severity:
Low

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

the rule FluentdQueueLengthIncreasing is still failing. We had fixed some issues with this rule in

https://issues.redhat.com/browse/LOG-2640

Customer, after having upgraded to 5.4.3 where the bug is fixed, still having an issue.
I have checked the rule expression in customer site:

expr: |
( 0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod,plugin_id) ( deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h]) > 1 )
for: 1h

The error seems to be the same but this time, we see that we are grouping by pod and plugin_id. When I check the error, I see:

ts=2022-10-19T08:02:27.683Z caller=manager.go:609 level=warn component="rule manager" group=logging_fluentd.alerts msg="Evaluating rule failed" rule="alert: FluentdQueueLengthIncreasing\nexpr: (0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod, plugin_id)\n (deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h])\n > 1)\nfor: 1h\nlabels:\n service: fluentd\n severity: Warning\nannotations:\n message: For the last hour, fluentd {{ $labels.instance }} output '{{ $labels.plugin_id\n }}' average buffer queue length has increased continuously.\n summary: Fluentd is unable to keep up with traffic over time for forwarder output\n {{ $labels.plugin_id }}.\n" err="found duplicate series for the match group

{plugin_id=\"default\", pod=\"collector-gt6l4\"}

on the right hand-side of the operation: [

{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

,

{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

];many-to-many matching not allowed: matching labels must be unique on one side"

the tuples are:

{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"} {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

Should we group also by instance ?

Thanks in advance.

Version-Release number of selected component (if applicable): 5.4.3

is cloned by

LOG-3245 [release-5.6] FluentdQueueLengthIncreasing rule failing to be evaluated.