-
Bug
-
Resolution: Done
-
Normal
-
Logging 5.4.z
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
Before this change, the alert could fail to fire when there was a cardinality issue with the set of labels returned from this alert expression. This fix corrects that by reducing the labels to only include those required for the alert.
-
Log Collection - Sprint 227
-
Low
Description of problem:
the rule FluentdQueueLengthIncreasing is still failing. We had fixed some issues with this rule in
https://issues.redhat.com/browse/LOG-2640
Customer, after having upgraded to 5.4.3 where the bug is fixed, still having an issue.
I have checked the rule expression in customer site:
expr: |
( 0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod,plugin_id) ( deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h]) > 1 )
for: 1h
The error seems to be the same but this time, we see that we are grouping by pod and plugin_id. When I check the error, I see:
ts=2022-10-19T08:02:27.683Z caller=manager.go:609 level=warn component="rule manager" group=logging_fluentd.alerts msg="Evaluating rule failed" rule="alert: FluentdQueueLengthIncreasing\nexpr: (0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod, plugin_id)\n (deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h])\n > 1)\nfor: 1h\nlabels:\n service: fluentd\n severity: Warning\nannotations:\n message: For the last hour, fluentd {{ $labels.instance }} output '{{ $labels.plugin_id\n }}' average buffer queue length has increased continuously.\n summary: Fluentd is unable to keep up with traffic over time for forwarder output\n {{ $labels.plugin_id }}.\n" err="found duplicate series for the match group
{plugin_id=\"default\", pod=\"collector-gt6l4\"}on the right hand-side of the operation: [
{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"},
{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}];many-to-many matching not allowed: matching labels must be unique on one side"
the tuples are:
{container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"} {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}Should we group also by instance ?
Thanks in advance.
Version-Release number of selected component (if applicable): 5.4.3
- clones
-
LOG-3226 FluentdQueueLengthIncreasing rule failing to be evaluated.
- Closed
- links to
- mentioned on