-
Task
-
Resolution: Done
-
Undefined
-
None
-
False
-
None
-
False
-
NEW
-
NEW
-
Logging (LogExp) - Sprint 216, Logging (LogExp) - Sprint 217, Logging (LogExp) - Sprint 218, Logging (LogExp) - Sprint 219, Logging (LogExp) - Sprint 220, Log Storage - Sprint 221, Log Storage - Sprint 222, Log Storage - Sprint 223, Log Storage - Sprint 224, Log Storage - Sprint 225, Log Storage - Sprint 226, Log Storage - Sprint 227, Log Storage - Sprint 228, Log Storage - Sprint 229, Log Storage - Sprint 230, Log Storage - Sprint 231, Log Storage - Sprint 232, Log Storage - Sprint 233
There is an enhancement document with a dozen of alerts:
https://github.com/openshift/enhancements/blob/master/enhancements/cluster-logging/loki-observability.md
Currently, only the first 2 alerts were added:
- LokiRequestErrors
- LokiRequestPanics
The PR that added them is:
https://github.com/grafana/loki/pull/5345
Initially, the PR added all of them. But, because of lack of experience with real customers, we weren't sure whether all of them have real value or how to set their thrsholds. In addition, the xxxHighLoad alerts require defining a threshold depending on the lokistack T-shirt size. This added complexity to the PR. Eventually we settled down to these 2 alerts.
A few points to summarize the discussion in the PR:
- We received feedback from Grafana's Principal SWE that suggests to remove the StorageSlow alerts (which based on boltdb_shipper latency metric) and look at the distributor latency instead. This probably makes sense in Grafana's constellation where they don't control their storage ( ? ). However, we might want to keep it.
- Regarding the LokiReadPathHighLoad alert, he suggests implement it using the cortex_query_frontend_queue_length which should never stay too high for a long period.
- The alerts LokiMemoryHigh and LokiCPUHigh won't fire since they are based on k8s resource limits that we don't set.
- For frequently used or computationally expensive queries, recording rules may be defined.