[LOG-2746] CloudWatch forwarding rejecting large log events, fills tmpfs

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: Logging 5.5.0
Affects Version/s: None
Component/s: Log Collection
Labels:
- devel_ack+
- rn-done-resolved

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:
* Before this update, clusters configured to forward logs to Amazon CloudWatch wrote rejected log files to temporary storage, causing cluster instability over time. With this update, chunk backup for CloudWatch has been disabled, resolving the issue.

Sprint:
Logging (Core) - Sprint 220, Log Collection - Sprint 221
Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

On several clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:

Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError)

The rejected logs are written to tmpfs on the node running the collector pod:

2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log

Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.

For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.

Creating this Jira on request per Slack thread [0], cc rhn-engineering-aconway

https://coreos.slack.com/archives/CB3HXM2QK/p1652711924055009?thread_ts=1652660899.213729&cid=CB3HXM2QK

clones

LOG-2635 CloudWatch forwarding rejecting large log events, fills tmpfs

Closed

links to

openshift/cluster-logging-operator#1520: LOG-2746: Disable chunk backup for all outputs

mentioned on

Merge request - Updated 3 upstream sources

Anping Li added a comment - 2022/06/28 2:10 PM

Verified in clusterlogging.v5.5.0. disable_chunk_backup was enabled.

Anping Li added a comment - 2022/06/28 2:10 PM Verified in clusterlogging.v5.5.0. disable_chunk_backup was enabled.

GitLab CEE Bot added a comment - 2022/06/23 1:18 PM

CPaaS Service Account mentioned this issue in a merge request of openshift-logging / cpaas-config on branch openshift-logging-5.5-rhel-8_upstream_e1f1073fc11a12f7c6c71abf8ee0ecfa:

Updated 3 upstream sources

GitLab CEE Bot added a comment - 2022/06/23 1:18 PM CPaaS Service Account mentioned this issue in a merge request of openshift-logging / cpaas-config on branch openshift-logging-5.5-rhel-8_ upstream _e1f1073fc11a12f7c6c71abf8ee0ecfa : Updated 3 upstream sources

Assignee:: Jeffrey Cantrill

Reporter:: Matt Bargenquast (Inactive)

QA Contact:: Anping Li

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/06/17 7:20 PM

Updated:: 2022/08/03 6:15 PM

Resolved:: 2022/06/28 2:11 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Anping Li added a comment - 2022/06/28 2:10 PM

Expand comment: Anping Li added a comment - 2022/06/28 2:10 PM

Collapse comment: GitLab CEE Bot added a comment - 2022/06/23 1:18 PM

Expand comment: GitLab CEE Bot added a comment - 2022/06/23 1:18 PM

People

Dates