Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: Logging 6.0, Logging 6.1, Logging 6.2, Logging 6.3
Component/s: Log Storage
Labels:

Activity Type:
Product / Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
PM Score:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:
PX Review Complete:

Intelligence Requested:
Market:

1. Proposed title of this feature request.

Introduce a new AlertRule to detect flush failures from loki-ingester to object storage in Logging 6.

2. What is the nature and description of the request?

A new alert as shown below should be introduced, to detect failures of flushing data from loki-ingester to object storage.

apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
  name: loki-flush-failure
  namespace: openshift-monitoring
spec:
  groups:
  - name: loki-flush-failure
    rules:
    - alert: LokiIngesterFlushGotStacked
      for: 15m
      expr: "sum(loki_ingester_flush_queue_length) > 0" 
      labels:
        severity: critical
      annotations:
        message: Loki's flush of ingester data got stacked for 15m
    - alert: LokiIngesterFlushGotFailed
      for: 1m
      expr: "sum(increase(loki_ingester_chunks_flush_failures_total[10m])) > 0" 
      labels:
        severity: warning
      annotations:
        message: Loki failed to flush ingester's data to object storage

3. Why does the customer need this? (List the business requirements here)

Object storage can go down in some cases (due to hardware failure, etc.).
Loki fails to flush data in such a case, but loki-ingester's pod status still remains "1/1 Running" and no alerts are fired.
So customers delays to notice the failures. This is the point we want to improve.

We know that loki-ingester still can work even when object storage was down.
It is still able to receive data from log-collector, then keep it on WAL disk or in memory(when WAL disk is full) until object storage comes back.
This would be a reason why the pod's status still remains "1/1 Running" and there are no alerts.

However, if object storage continued to be down, loki-ingester pod would crash due to OOM eventually, resulting in data loss.
Customers want to prevent this, especially in mission critical production environment.
It's important to notify customers of failures early and provide them with an opportunity to investigate and recover object storage quickly.

4. List any affected packages or components.

loki-operator

Assignee:: Jamie Parker

Reporter:: Divyanshi Srivastava

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Due:: 2024/12/06

Created:: 2024/12/06 5:22 AM

Updated:: 2025/10/10 9:28 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates