Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-3226

FluentdQueueLengthIncreasing rule failing to be evaluated.

XMLWordPrintable

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Before this change, the alert could fail to fire when there was a cardinality issue with the set of labels returned from this alert expression. This fix corrects that by reducing the labels to only include those required for the alert.
    • Log Collection - Sprint 227, Log Collection - Sprint 228
    • Low

      Description of problem:

      the rule FluentdQueueLengthIncreasing is still failing. We had fixed some issues with this rule in

      https://issues.redhat.com/browse/LOG-2640

      Customer, after having upgraded to 5.4.3 where the bug is fixed, still having an issue.
      I have checked the rule expression in customer site:

      expr: |
      ( 0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod,plugin_id) ( deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h]) > 1 )
      for: 1h

      The error seems to be the same but this time, we see that we are grouping by pod and plugin_id. When I check the error, I see:

      ts=2022-10-19T08:02:27.683Z caller=manager.go:609 level=warn component="rule manager" group=logging_fluentd.alerts msg="Evaluating rule failed" rule="alert: FluentdQueueLengthIncreasing\nexpr: (0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod, plugin_id)\n (deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h])\n > 1)\nfor: 1h\nlabels:\n service: fluentd\n severity: Warning\nannotations:\n message: For the last hour, fluentd {{ $labels.instance }} output '{{ $labels.plugin_id\n }}' average buffer queue length has increased continuously.\n summary: Fluentd is unable to keep up with traffic over time for forwarder output\n {{ $labels.plugin_id }}.\n" err="found duplicate series for the match group

      {plugin_id=\"default\", pod=\"collector-gt6l4\"}

      on the right hand-side of the operation: [

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      ,

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      ];many-to-many matching not allowed: matching labels must be unique on one side"

      the tuples are:

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"} {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      Should we group also by instance ?

      Thanks in advance.

      Version-Release number of selected component (if applicable): 5.4.3

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-gparente German Parente
              Kabir Bharti Kabir Bharti
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: