-
Bug
-
Resolution: Won't Do
-
Normal
-
Logging 5.4.0
-
False
-
None
-
False
-
NEW
-
NEW
-
Bug Fix
Symptoms
In an SNO cluster with no particular extra load, the forwader gets sometimes get stuck in a loop with "buffer to big" and/or "rate exceeded" errors.
This appears to happen when the collector is started, when it is trying to push the (possibly large) data already on disk as quickly as possible to catch up to the latest logs. The average log rate of the cluster under test is definitely not exceeding the ingest rate.
Reproduce
Fluentd
Not 100% reproducable but happends frequently (about 1 in 3 tries)
In an SNO (or other) cluster run the Loki e2e test:
cd test/e2e/logforwarding/Loki go test -run Fluentd
Result: test hangs, collector log shows errors like these:
2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased ) 2022-03-22 14:14:15 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=1 next_retry_time=2022-03-22 14:14:16 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '3008' lines totaling '5063913' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n" 2022-03-22 14:14:15 +0000 [warn]: suppressed same stacktrace 2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to write post to http://loki-receiver.test-d46befyo.svc.cluster.local:3100/loki/api/v1/push (500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304) ) 2022-03-22 14:14:16 +0000 [warn]: [loki_receiver] failed to flush the buffer. retry_times=2 next_retry_time=2022-03-22 14:14:19 +0000 chunk="5dacf198837beb1be8fc78fef6b51f15" error_class=Fluent::Plugin::LokiOutput::LogPostError error="500 Internal Server Error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5123331 vs. 4194304)\n" 2022-03-22 14:14:16 +0000 [warn]: suppressed same stacktrace
Vector
Not able to reproduce with Vector, possibly it has safer defaults for the relevant configuration settings.
Fix
There are two problems in the logs.
- Max buffer size too big (to be fixed by this JIRA)
- Ingestion rate exceeded (rate limiting, will open a seprate JIRA)
To address the buffer size:
- Reduce the default max buffer size to < 4Mb. This could be done just fojjr Loki outputs, but it would probably be safe and reasonable to do globally for the fluentd collector.
- Introduce configuration to control max buffer size - only if we decide not to introduce rate limiting, otherwise the max buffer can be computed from the rate limit
Notes
From https://grafana.com/docs/loki/latest/configuration/#limits_config
grpc_server_max_recv_msg_size: <int> | default = 4194304 (B) ingestion_rate_mb: <float> | default = 4 (MB) ingestion_burst_size_mb: <int> | default = 6 (MB)
TODO: Link to JIRAs for rate limiting