Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-6827

Introduce a new AlertRule to detect flush failures from loki-ingester to object storage in Logging 6

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.16
    • Monitoring
    • None
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request.

      Introduce a new AlertRule to detect flush failures from loki-ingester to object storage in Logging 6.

      2. What is the nature and description of the request?

      A new alert as shown below should be introduced, to detect failures of flushing data from loki-ingester to object storage.

      apiVersion: monitoring.openshift.io/v1
      kind: AlertingRule
      metadata:
        name: loki-flush-failure
        namespace: openshift-monitoring
      spec:
        groups:
        - name: loki-flush-failure
          rules:
          - alert: LokiIngesterFlushGotStacked
            for: 15m
            expr: "sum(loki_ingester_flush_queue_length) > 0" 
            labels:
              severity: critical
            annotations:
              message: Loki's flush of ingester data got stacked for 15m
          - alert: LokiIngesterFlushGotFailed
            for: 1m
            expr: "sum(increase(loki_ingester_chunks_flush_failures_total[10m])) > 0" 
            labels:
              severity: warning
            annotations:
              message: Loki failed to flush ingester's data to object storage 

      3. Why does the customer need this? (List the business requirements here)

      Object storage can go down in some cases (due to hardware failure, etc.).
      Loki fails to flush data in such a case, but loki-ingester's pod status still remains "1/1 Running" and no alerts are fired.
      So customers delays to notice the failures. This is the point we want to improve.

      We know that loki-ingester still can work even when object storage was down.
      It is still able to receive data from log-collector, then keep it on WAL disk or in memory(when WAL disk is full) until object storage comes back.
      This would be a reason why the pod's status still remains "1/1 Running" and there are no alerts.

      However, if object storage continued to be down, loki-ingester pod would crash due to OOM eventually, resulting in data loss.
      Customers want to prevent this, especially in mission critical production environment.
      It's important to notify customers of failures early and provide them with an opportunity to investigate and recover object storage quickly.

      4. List any affected packages or components.

      loki-operator

              jamparke@redhat.com Jamie Parker
              rhn-support-dsrivast Divyanshi Srivastava
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: