-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
Logging 5.1, Logging 5.3.0
-
False
-
None
-
False
-
NEW
-
NEW
-
Bug Fix
1. Bug Overview:
a) Description of bug report:
[RHOCP4.8.17] Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.
b) Bug Description:
Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.
Are these alerts' threshold correct ?
2. Bug Details:
a) Architectures: 64-bit Intel EM64T/AMD64
x86_64
b) Bugzilla Dependencies:
c) Drivers or hardware dependencies:
d) Upstream acceptance information:
e) External links:
f) Severity (H,M,L):
M
g) How reproducible:
Always
h) Steps to Reproduce:
1. Configure ClusterLogForwarder to forward logs to not exist receiver(e.g. external elasticsearch).
This causes continuous errors in fluentd.
2. Wait for a hour.
i) Actual results:
Neither FluentDHighErrorRate nor FluentDVeryHighErrorRate are fired even when fluentd is outputting errors continuously.
In our environment, the following expression does not exceed 6 even when fluentd is outputting errors continuously.
It can never reach FluentDHighErrorRate threshold(10), so FluentDHighErrorRate is never fired.
100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))
j) Expected results:
FluentDHighErrorRate and FluentDVeryHighErrorRate are fired when fluentd is outputting errors continuously.
k) Additional information
Alert Settings
name: FluentDHighErrorRate
expr: 100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))) > 10
for: 15m
labels:
severity: warning
annotations:
message: {{ $value }}% of records have resulted in an error by fluentd {{ $labels.instance }}.
summary: FluentD output errors are high
name: FluentDVeryHighErrorRate
expr: 100 * (sum by(instance) (rate(fluentd_output_status_num_errors[2m])) / sum by(instance) (rate(fluentd_output_status_emit_records[2m]))) > 25
for: 15m
labels:
severity: critical
annotations:
message: {{ $value }}% of records have resulted in an error by fluentd {{ $labels.instance }}.
3. Business impact:
Because of this issue, users may not notice fluetnd abnormality from alerts.
So users cannot take a early response. Finally it may cause user lost their logs.