Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-115

Create alerting rules based on logs

XMLWordPrintable

    • False
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      Problem Alignment

      The Problem

      Many customers still predominantly use logs as a main source to capture data that's important to quickly identify problems. Many issues can also be identified by metrics but there are some events in security, such as suspicious IP address activity, or runtime system issues such as host errors, where logs are your friend. OpenShift currently only support defining alerting rules and get notification based on metrics. That leaves a big gap to help identifying and being notified for the previous mentioned events immediately.

      High-Level Approach

      As we move the Logging stack towards using Loki (see OBSDA-7), we will be able to use it's out-of-the-box capabilities to define alerting rules on logs using LogQL. That approach is very similar to Prometheus' alerting ecosystem and actually gives us the opportunity to reuse Prometheus' Alertmanager to distribute alerts/notifications. For customers, this means they do not need to configure different channels twice, for metrics and logs, but reuse the same configuration.

      For the configuration itself, we need to look into introducing a CRD (similar to the PrometheusRule CRD inside the Prometheus Operator) to allow users with non-admin permissions to configure the rules without changing the central Loki configuration.

      Goal & Success

      • Allow individual users to configure alerting rules based on patterns inside a log record.

      Solution Alignment

      Key Capabilities

      • As an Application SRE, I'd like to configure SLIs to get alerted when the number of messages that meet some criteria (e.g. errors) exceeds a particular threshold.
      • As an Application SRE, I'd like to configure where alerts will be send so that I get notified on the right channels.

      Key Flows

      Open Questions & Key Decisions (optional)

      • Do we provide integration into Prometheus Alertmanager only and if so, how?
        • Note: We could integrate into our Monitoring's Alertmanager automatically but what happens if a customer decided to use an external Alertmanager and configures that inside Monitoring. I think we need to discuss this with the Monitoring team and identify if we actually want a more centralized approach to Alerting as supposed to divide into Metrics and Logs with each some dedicated instance. I think that's another perfect use case for why Observatorium would be better to use in general in the future. It combines Metrics and Log stack into one single deployment and we could only expose a single Alertmanager and a configuration for pointing Prometheus and Loki to an external instance if necessary.

              jamparke@redhat.com Jamie Parker
              jamparke@redhat.com Jamie Parker
              Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: