-
Task
-
Resolution: Done
-
Undefined
-
None
-
3
-
False
-
False
-
NEW
-
OBSDA-7 - Adopting Loki as an alternative to Elasticsearch to support more lightweight, easier to manage/operate storage scenarios
-
VERIFIED
-
-
Logging (LogExp) - Sprint 211, Logging (LogExp) - Sprint 212, Logging (LogExp) - Sprint 214, Logging (LogExp) - Sprint 215
As an OpenShift administrator, I want to receive alerts for upcoming or present unhealthy conditions of the Loki cluster so that I can proactively take counter-measurements for recovery.
Acceptance criteria
- Define a set of alerts per components for most-common unhealthy conditions of a Loki clusters.
- Each alert is registered per PrometheusRule in OpenShift cluster monitoring.
- Alerts fire in the OpenShift cluster monitoring alertmanager when their trigger are active.
Notes
- Investigate on upstream for grafana provided alerts on Loki cluster health
- Investigate if any alerts require custom recording rules to execute complex aggregations.
- Compile a document with a list of name, description and purpose of each alert as well as recommended thresholds to activate.
- Enhance the document with a list of recommendations on how to aggregate the alerts per path if possible, e.g. ingestion alerts vs. querying alerts
- Provide the final list of alerts as a static reconcilable PrometheusRule custom resource per LokiStack instance in the Loki-Operator.
- Provide prometheus rules unit tests for the final set of alerts (Example on how to test alerts/rules in https://github.com/openshift/elasticsearch-operator/blob/master/test/files/prometheus-unit-tests/test.yml)
- Initial investigation work on what alert types we want to have is here: https://docs.google.com/document/d/1-hJ8l-sQPVBcdCNXUsIF-0FCNbf1F18O/edit
- The enhancement proposal document is available here: https://github.com/openshift/enhancements/blob/master/enhancements/cluster-logging/loki-observability.md