Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2399

Loki output exceeds Loki buffer size and rate limits.

    XMLWordPrintable

Details

    • False
    • None
    • False
    • NEW
    • NEW
    • Bug Fix

    Description

      Symptoms

      In an SNO cluster with no particular extra load, the  forwader gets sometimes get stuck in a loop with "buffer to big" and/or "rate exceeded" errors.

      This appears to happen when the collector is started, when it is trying to push the (possibly large) data already on disk as quickly as possible to catch up to the latest logs. The average log rate of the cluster under test is definitely not exceeding the ingest rate.

      Reproduce

      Fluentd

      Not 100% reproducable but happends frequently (about 1 in 3 tries)

      In an SNO (or other) cluster run the Loki e2e test:

      cd test/e2e/logforwarding/Loki
      go test -run Fluentd 

      Result: test hangs, collector log shows errors like these:

      2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased
      )
      2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=1 next_retry_time=2022-03-22 14:14:16 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"
        2022-03-22 14:14:15 +0000 [warn]: suppressed same stacktrace
      2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304)
      )
      2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=2 next_retry_time=2022-03-22 14:14:19 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304)\n"
        2022-03-22 14:14:16 +0000 [warn]: suppressed same stacktrace
      

      Vector

      Not able to reproduce with Vector, possibly it has safer defaults for the relevant configuration settings.

      Fix

      There are two problems in the logs.

      • Max buffer size too big (to be fixed by this JIRA)
      • Ingestion rate exceeded (rate limiting, will open a seprate JIRA)

      To address the buffer size:

      1. Reduce the default max buffer size to < 4Mb. This could be done just fojjr Loki outputs, but it would probably be safe and reasonable to do globally for the fluentd collector.
      2. Introduce configuration to control max buffer size - only if we decide not to introduce rate limiting, otherwise the max buffer can be computed from the rate limit

      Notes

      From https://grafana.com/docs/loki/latest/configuration/#limits_config

      grpc_server_max_recv_msg_size: <int> | default = 4194304 (B)
      ingestion_rate_mb: <float> | default = 4  (MB)
      ingestion_burst_size_mb: <int> | default = 6  (MB) 

      TODO: Link to JIRAs for rate limiting

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rhn-engineering-aconway Alan Conway
              Kabir Bharti Kabir Bharti
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: