Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2457

Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • Logging 5.1, Logging 5.3.0
    • Log Collection
    • False
    • None
    • False
    • NEW
    • NEW
    • Bug Fix

      1. Bug Overview:
      a) Description of bug report:

      [RHOCP4.8.17] Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.

      b) Bug Description:

      Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.
      Are these alerts' threshold correct ?

      2. Bug Details:

      a) Architectures: 64-bit Intel EM64T/AMD64
      x86_64

      b) Bugzilla Dependencies:

      c) Drivers or hardware dependencies:

      d) Upstream acceptance information:

      e) External links:

      f) Severity (H,M,L):
      M

      g) How reproducible:
      Always

      h) Steps to Reproduce:

      1. Configure ClusterLogForwarder to forward logs to not exist receiver(e.g. external elasticsearch).
      This causes continuous errors in fluentd.
      2. Wait for a hour.

      i) Actual results:

      Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.

      In our environment, the following expression does not exceed 6 even when fluentd is outputting errors continuously.
      It can never reach FluentDHighErrorRate threshold(10), so FluentDHighErrorRate is never fired.

      100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))

      j) Expected results:

      FluentDHighErrorRate and FluentDVeryHighErrorRate are fired when fluentd is outputting errors continuously.

      k) Additional information

      Alert Settings

      name: FluentDHighErrorRate
      expr: 100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))) > 10
      for: 15m
      labels:
      severity: warning
      annotations:
      message: {{ $value }}% of records have resulted in an error by fluentd {{ $labels.instance }}.
      summary: FluentD output errors are high

      name: FluentDVeryHighErrorRate
      expr: 100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))) > 25
      for: 15m
      labels:
      severity: critical
      annotations:
      message: {{ $value }}% of records have resulted in an error by fluentd {{ $labels.instance }}.

      3. Business impact:

      Because of this issue, users may not notice fluetnd abnormality from alerts.
      So users cannot take a early response. Finally it may cause user lost their logs.

              cahartma@redhat.com Casey Hartman
              rhn-support-mfuruta Masaki Furuta
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: