-
Feature Request
-
Resolution: Unresolved
-
Normal
-
None
-
4.16
-
None
-
False
-
None
-
False
-
Not Selected
-
-
-
-
1. Proposed title of this feature request.
Introduce a new AlertRule to detect flush failures from loki-ingester to object storage in Logging 6.
2. What is the nature and description of the request?
A new alert as shown below should be introduced, to detect failures of flushing data from loki-ingester to object storage.
apiVersion: monitoring.openshift.io/v1 kind: AlertingRule metadata: name: loki-flush-failure namespace: openshift-monitoring spec: groups: - name: loki-flush-failure rules: - alert: LokiIngesterFlushGotStacked for: 15m expr: "sum(loki_ingester_flush_queue_length) > 0" labels: severity: critical annotations: message: Loki's flush of ingester data got stacked for 15m - alert: LokiIngesterFlushGotFailed for: 1m expr: "sum(increase(loki_ingester_chunks_flush_failures_total[10m])) > 0" labels: severity: warning annotations: message: Loki failed to flush ingester's data to object storage
3. Why does the customer need this? (List the business requirements here)
Object storage can go down in some cases (due to hardware failure, etc.).
Loki fails to flush data in such a case, but loki-ingester's pod status still remains "1/1 Running" and no alerts are fired.
So customers delays to notice the failures. This is the point we want to improve.
We know that loki-ingester still can work even when object storage was down.
It is still able to receive data from log-collector, then keep it on WAL disk or in memory(when WAL disk is full) until object storage comes back.
This would be a reason why the pod's status still remains "1/1 Running" and there are no alerts.
However, if object storage continued to be down, loki-ingester pod would crash due to OOM eventually, resulting in data loss.
Customers want to prevent this, especially in mission critical production environment.
It's important to notify customers of failures early and provide them with an opportunity to investigate and recover object storage quickly.
4. List any affected packages or components.
loki-operator