Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2635

CloudWatch forwarding rejecting large log events, fills tmpfs

XMLWordPrintable

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Before this update, clusters configured to perform CloudWatch forwarding wrote rejected log files to temporary storage, causing cluster instability over time. With this update, chunk backup for CloudWatch has been disabled, resolving the issue.
    • Logging (Core) - Sprint 219, Logging (Core) - Sprint 220
    • Critical

      On several clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:

      Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError) 

      The rejected logs are written to tmpfs on the node running the collector pod:

      2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log 

      Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.

      For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.

      Creating this Jira on request per Slack thread [0], cc rhn-engineering-aconway 

      https://coreos.slack.com/archives/CB3HXM2QK/p1652711924055009?thread_ts=1652660899.213729&cid=CB3HXM2QK

              jcantril@redhat.com Jeffrey Cantrill
              mbargenq Matt Bargenquast (Inactive)
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: