Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: Logging 5.5.9
Affects Version/s: None
Component/s: Log Storage, Loki
Labels:
- collab

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW

Sprint:
Logging (LogExp) - Sprint 216, Logging (LogExp) - Sprint 217, Logging (LogExp) - Sprint 218, Logging (LogExp) - Sprint 219, Logging (LogExp) - Sprint 220, Log Storage - Sprint 221, Log Storage - Sprint 222, Log Storage - Sprint 223, Log Storage - Sprint 224, Log Storage - Sprint 225, Log Storage - Sprint 226, Log Storage - Sprint 227, Log Storage - Sprint 228, Log Storage - Sprint 229, Log Storage - Sprint 230, Log Storage - Sprint 231, Log Storage - Sprint 232, Log Storage - Sprint 233

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

There is an enhancement document with a dozen of alerts:
https://github.com/openshift/enhancements/blob/master/enhancements/cluster-logging/loki-observability.md

Currently, only the first 2 alerts were added:

LokiRequestErrors
LokiRequestPanics

The PR that added them is:
https://github.com/grafana/loki/pull/5345

Initially, the PR added all of them. But, because of lack of experience with real customers, we weren't sure whether all of them have real value or how to set their thrsholds. In addition, the xxxHighLoad alerts require defining a threshold depending on the lokistack T-shirt size. This added complexity to the PR. Eventually we settled down to these 2 alerts.

A few points to summarize the discussion in the PR:

We received feedback from Grafana's Principal SWE that suggests to remove the StorageSlow alerts (which based on boltdb_shipper latency metric) and look at the distributor latency instead. This probably makes sense in Grafana's constellation where they don't control their storage ( ? ). However, we might want to keep it.

Regarding the LokiReadPathHighLoad alert, he suggests implement it using the cortex_query_frontend_queue_length which should never stay too high for a long period.

The alerts LokiMemoryHigh and LokiCPUHigh won't fire since they are based on k8s resource limits that we don't set.
For frequently used or computationally expensive queries, recording rules may be defined.

relates to

LOG-1799 Add cluster-monitoring recording rules and alerts to manage Loki cluster health

Closed

LOG-1815 Enhancement proposal: Add alerts and rules for operator-managed LokiStack

Closed

links to

Enhancement document

Implementation PR

Issue

(1 links to)

Assignee:: Gerard Vanloo (Inactive)

Reporter:: Ronen Schaffer (Inactive)

QA Contact:: Kabir Bharti

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/03/09 1:50 PM

Updated:: 2023/03/21 11:56 PM

Resolved:: 2023/03/21 11:56 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates