XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Normal
Fix Version/s: Logging 5.7, Logging 5.8
Affects Version/s: None
Component/s: Log Storage, PM Logging
Labels:
- no_epic
- pm_ack+
- pm_old

Blocked:
False
Ready:
False
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:
Undefined

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Problem Alignment

The Problem

Many customers still predominantly use logs as a main source to capture data that's important to quickly identify problems. Many issues can also be identified by metrics but there are some events in security, such as suspicious IP address activity, or runtime system issues such as host errors, where logs are your friend. OpenShift currently only support defining alerting rules and get notification based on metrics. That leaves a big gap to help identifying and being notified for the previous mentioned events immediately.

High-Level Approach

As we move the Logging stack towards using Loki (see ~~OBSDA-7~~), we will be able to use it's out-of-the-box capabilities to define alerting rules on logs using LogQL. That approach is very similar to Prometheus' alerting ecosystem and actually gives us the opportunity to reuse Prometheus' Alertmanager to distribute alerts/notifications. For customers, this means they do not need to configure different channels twice, for metrics and logs, but reuse the same configuration.

For the configuration itself, we need to look into introducing a CRD (similar to the PrometheusRule CRD inside the Prometheus Operator) to allow users with non-admin permissions to configure the rules without changing the central Loki configuration.

Goal & Success

Allow individual users to configure alerting rules based on patterns inside a log record.

Solution Alignment

Key Capabilities

As an Application SRE, I'd like to configure SLIs to get alerted when the number of messages that meet some criteria (e.g. errors) exceeds a particular threshold.
As an Application SRE, I'd like to configure where alerts will be send so that I get notified on the right channels.

Key Flows

Open Questions & Key Decisions (optional)

Do we provide integration into Prometheus Alertmanager only and if so, how?
- Note: We could integrate into our Monitoring's Alertmanager automatically but what happens if a customer decided to use an external Alertmanager and configures that inside Monitoring. I think we need to discuss this with the Monitoring team and identify if we actually want a more centralized approach to Alerting as supposed to divide into Metrics and Logs with each some dedicated instance. I think that's another perfect use case for why Observatorium would be better to use in general in the future. It combines Metrics and Log stack into one single deployment and we could only expose a single Alertmanager and a configuration for pointing Prometheus and Loki to an external instance if necessary.

Assignee:: Jamie Parker

Reporter:: Jamie Parker

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021/08/06 8:49 AM

Updated:: 2024/12/20 11:12 PM

Resolved:: 2024/01/25 6:27 PM

Details

Description

Problem Alignment

The Problem

High-Level Approach

Goal & Success

Solution Alignment

Key Capabilities

Key Flows

Open Questions & Key Decisions (optional)

Attachments

Easy Agile Planning Poker

Activity

People

Dates