-
Bug
-
Resolution: Done
-
Major
-
None
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
* Before this update, clusters configured to forward logs to Amazon CloudWatch wrote rejected log files to temporary storage, causing cluster instability over time. With this update, chunk backup for CloudWatch has been disabled, resolving the issue.
-
Logging (Core) - Sprint 220, Log Collection - Sprint 221
-
Critical
On several clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:
Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError)
The rejected logs are written to tmpfs on the node running the collector pod:
2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log
Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.
For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.
Creating this Jira on request per Slack thread [0], cc rhn-engineering-aconway
- clones
-
LOG-2635 CloudWatch forwarding rejecting large log events, fills tmpfs
- Closed
- links to
- mentioned on