Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: Logging 5.2.11
Affects Version/s: Logging 5.2.z
Component/s: Log Collection
Labels:

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:
Before this update, clusters configured to perform CloudWatch forwarding wrote rejected log files to temporary storage, causing cluster instability over time. With this update, chunk backup for CloudWatch has been disabled, resolving the issue.

Sprint:
Logging (Core) - Sprint 219, Logging (Core) - Sprint 220
Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

On several clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:

Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError)

The rejected logs are written to tmpfs on the node running the collector pod:

2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log

Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.

For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.

Creating this Jira on request per Slack thread [0], cc rhn-engineering-aconway

https://coreos.slack.com/archives/CB3HXM2QK/p1652711924055009?thread_ts=1652660899.213729&cid=CB3HXM2QK

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fluent.conf
16 kB
2022/06/08 7:21 AM

causes

LOG-2629 Rosa installed cluster-logging-operator should have resource limit set

Closed

is cloned by

LOG-2746 CloudWatch forwarding rejecting large log events, fills tmpfs

Closed

is related to

LOG-2775 Port LOG-2635 fix into CLO 5.4

Closed

relates to

LOG-2758 Limit maximum record size

Closed

LOG-2851 Limit maximum write batch size

Closed

links to

openshift/cluster-logging-operator#1483: LOG-2635: Disable chunk backup for cloudwatch

openshift/cluster-logging-operator#1496: LOG-2635: Remove colon from config parameter

mentioned on

Merge request - LOG-2635: fix filling temp space

Merge request - Updated US source to: 5bf0de1 Merge pull request #1502 from openshift-cherrypick-robot/cherry-pick-1473-to-release-5.2

Merge request - Updated US source to: 93da4e2 Merge pull request #1483 from jcantrill/log2635

(2 links to, 3 mentioned on)

Assignee:: Jeffrey Cantrill

Reporter:: Matt Bargenquast (Inactive)

QA Contact:: Anping Li

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2022/05/16 10:58 PM

Updated:: 2022/06/30 1:23 PM

Resolved:: 2022/06/13 11:52 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates