XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Log Collection
Labels:
- collab
- no-doc
- no-qe
- spike

Story Points:
5
Blocked:
False
Ready:
False
Epic Link:
[Core] Deep Insights
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:
Undefined
Market:

Sprint:
Logging (Core) - Sprint 197, Logging (Core) - Sprint 198, Logging (Core) - Sprint 199

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Story

As an admin I can get approximate data flow metrics (bytes/second) for each forwarder input:

bytes_logged (counter) bytes written to log files by containers.
bytes_collected (counter) bytes read by the collector for forwarding.

Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

The metric will be labeled so that the user can monitor total logged/collected bytes or break it down to individual log streams.

TODO: define the labels to use.

Note we do not require 100% accuracy, the goal is to provide average trends so that SREs can identify problems or potential problems.

Acceptance Criteria

Metric data published via fluentd HTTP endpoint that is consumable by prometheus
Logged and Collected metrics available for each container-UUID
Tests for worst-case accuracy under heavy load (where large log-loss is occuring). Goal: 95% accurate?

Notes

The logged metric is estimated using the total size of files rotated away using fluentd's inotify watchers. Ideally this metric would be produced by CRI-O itself. It could also be estimated by an independent watcher process. Fluentd was chosen as the quickest path to a useful result because:

It already performs the inotify and stat calls, we just piggy-back to make the additional calculations.
It already has a HTTP server for publishing metrics, that is linked to prometheus.

This method is not perfect - under heavy load it is possible that fluentd could miss some inotify events. However, file rotations do not happen frequently compared to file reads so in most conditions we should get a reasonable estimate of loss. If necessary we can add heuristics based on file timestamps to guess and estimate missed notifications.

Open Questions

Define the labels attached to these metrics.

Need to review overlap of logging meta-data and observatorium labels for other cluster metrics.

relates to

LOG-884 Flow control mechanisms for more predictable log collection

Closed

LOG-1063 Metric for outbound log data loss at the forwarder

Closed

LOG-1194 Define and implement labels for per-log metrics

Closed

links to

openshift/cluster-logging-operator#979: Go implementation log file exporter jira ticket log 1188

openshift/origin-aggregated-logging#2060: addition of new metric

openshift/origin-aggregated-logging#2069: Metric for inbound log data loss at the collector jira ticket log 1032

openshift/origin-aggregated-logging#2070: Metric for inbound log data loss at the collector jira ticket log 1032

(2 links to)

1.	PoC - showing new metric getting published to Prometheus dashboard	Closed	Ajay Gupta (Inactive)
2.	Build process / steps - custom CLO fluentd image with plugin changes incorporated	Closed	Eran Raichstein (Inactive)
3.	OCP/CRI-O log rotation - understanding how different it is from containers/conmon log rotation implementation	Closed	Eran Raichstein (Inactive)
4.	Validation of plugin changes in OCP environment	Closed	Ajay Gupta (Inactive)
5.	Integrate log produced exporter to CLO	Closed	Unassigned

Assignee:: pratibha moogi (Inactive)

Reporter:: pratibha moogi (Inactive)

QA Contact:: Anping Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Due:: 2021/02/05

Created:: 2021/01/22 8:27 AM

Updated:: 2022/09/09 6:26 AM

Resolved:: 2021/04/12 4:24 PM

Details

Description

Story

Acceptance Criteria

Notes

Open Questions

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates