Symptoms

In an SNO cluster with no particular extra load, the forwader gets sometimes get stuck in a loop with "buffer to big" and/or "rate exceeded" errors.

This appears to happen when the collector is started, when it is trying to push the (possibly large) data already on disk as quickly as possible to catch up to the latest logs. The average log rate of the cluster under test is definitely not exceeding the ingest rate.

Reproduce

Fluentd

Not 100% reproducable but happends frequently (about 1 in 3 tries)

In an SNO (or other) cluster run the Loki e2e test:

cd test/e2e/logforwarding/Loki
go test -run Fluentd

Result: test hangs, collector log shows errors like these:

2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased
)
2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=1 next_retry_time=2022-03-22 14:14:16 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"
  2022-03-22 14:14:15 +0000 [warn]: suppressed same stacktrace
2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304)
)
2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=2 next_retry_time=2022-03-22 14:14:19 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304)\n"
  2022-03-22 14:14:16 +0000 [warn]: suppressed same stacktrace

Vector

Not able to reproduce with Vector, possibly it has safer defaults for the relevant configuration settings.

Fix

There are two problems in the logs.

Max buffer size too big (to be fixed by this JIRA)
Ingestion rate exceeded (rate limiting, will open a seprate JIRA)

To address the buffer size:

Reduce the default max buffer size to < 4Mb. This could be done just fojjr Loki outputs, but it would probably be safe and reasonable to do globally for the fluentd collector.
Introduce configuration to control max buffer size - only if we decide not to introduce rate limiting, otherwise the max buffer can be computed from the rate limit

Notes

From https://grafana.com/docs/loki/latest/configuration/#limits_config

grpc_server_max_recv_msg_size: <int> | default = 4194304 (B)
ingestion_rate_mb: <float> | default = 4  (MB)
ingestion_burst_size_mb: <int> | default = 6  (MB)

TODO: Link to JIRAs for rate limiting

depends on

LOG-2851 Limit maximum write batch size

Closed

relates to

LOG-884 Flow control mechanisms for more predictable log collection

Closed

Assignee:: Unassigned

Reporter:: Alan Conway

QA Contact:: Kabir Bharti

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/03/22 3:27 PM

Updated:: 2023/06/07 3:26 PM

Resolved:: 2023/06/07 3:26 PM

Details

Description

Symptoms

Reproduce

Fluentd

Vector

Fix

Notes

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates