Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-3226

FluentdQueueLengthIncreasing rule failing to be evaluated.

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Before this change, the alert could fail to fire when there was a cardinality issue with the set of labels returned from this alert expression. This fix corrects that by reducing the labels to only include those required for the alert.
    • Log Collection - Sprint 227, Log Collection - Sprint 228
    • Low

      Description of problem:

      the rule FluentdQueueLengthIncreasing is still failing. We had fixed some issues with this rule in

      https://issues.redhat.com/browse/LOG-2640

      Customer, after having upgraded to 5.4.3 where the bug is fixed, still having an issue.
      I have checked the rule expression in customer site:

      expr: |
      ( 0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod,plugin_id) ( deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h]) > 1 )
      for: 1h

      The error seems to be the same but this time, we see that we are grouping by pod and plugin_id. When I check the error, I see:

      ts=2022-10-19T08:02:27.683Z caller=manager.go:609 level=warn component="rule manager" group=logging_fluentd.alerts msg="Evaluating rule failed" rule="alert: FluentdQueueLengthIncreasing\nexpr: (0 * (deriv(fluentd_output_status_emit_records[1m] offset 1h))) + on(pod, plugin_id)\n (deriv(fluentd_output_status_buffer_queue_length[10m]) > 0 and delta(fluentd_output_status_buffer_queue_length[1h])\n > 1)\nfor: 1h\nlabels:\n service: fluentd\n severity: Warning\nannotations:\n message: For the last hour, fluentd {{ $labels.instance }} output '{{ $labels.plugin_id\n }}' average buffer queue length has increased continuously.\n summary: Fluentd is unable to keep up with traffic over time for forwarder output\n {{ $labels.plugin_id }}.\n" err="found duplicate series for the match group

      {plugin_id=\"default\", pod=\"collector-gt6l4\"}

      on the right hand-side of the operation: [

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      ,

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      ];many-to-many matching not allowed: matching labels must be unique on one side"

      the tuples are:

      {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.7:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"} {container=\"collector\", endpoint=\"metrics\", hostname=\"collector-gt6l4\", instance=\"10.129.18.6:24231\", job=\"collector\", namespace=\"openshift-logging\", plugin_id=\"default\", pod=\"collector-gt6l4\", service=\"collector\", type=\"elasticsearch\"}

      Should we group also by instance ?

      Thanks in advance.

      Version-Release number of selected component (if applicable): 5.4.3

            [LOG-3226] FluentdQueueLengthIncreasing rule failing to be evaluated.

            verified on cluster-logging.5.5.5 

            Kabir Bharti added a comment - verified on cluster-logging.5.5.5 

            CPaaS Service Account mentioned this issue in merge request !345 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.4-rhel-8_upstream_238e500b78e2e3e63447671f9691f033:

            Updated US source to: f03574a Merge pull request #1748 from jcantrill/log3252

            GitLab CEE Bot added a comment - CPaaS Service Account mentioned this issue in merge request !345 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.4-rhel-8_ upstream _238e500b78e2e3e63447671f9691f033 : Updated US source to: f03574a Merge pull request #1748 from jcantrill/log3252

            CPaaS Service Account mentioned this issue in merge request !338 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.5-rhel-8_upstream_2c420e36986312a1afbf62da3b7942cf:

            Updated US source to: 6b29e76 Merge pull request #1738 from jcantrill/log3226

            GitLab CEE Bot added a comment - CPaaS Service Account mentioned this issue in merge request !338 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.5-rhel-8_ upstream _2c420e36986312a1afbf62da3b7942cf : Updated US source to: 6b29e76 Merge pull request #1738 from jcantrill/log3226

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-gparente German Parente
              Kabir Bharti Kabir Bharti
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: