-
Bug
-
Resolution: Done
-
Blocker
-
Logging 5.2.z
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
Before this update, clusters configured to perform CloudWatch forwarding wrote rejected log files to temporary storage, causing cluster instability over time. With this update, chunk backup for CloudWatch has been disabled, resolving the issue.
-
Logging (Core) - Sprint 219, Logging (Core) - Sprint 220
-
Critical
On several clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:
Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError)
The rejected logs are written to tmpfs on the node running the collector pod:
2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log
Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.
For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.
Creating this Jira on request per Slack thread [0], cc rhn-engineering-aconway