Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2132

Correlation service for observability data

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Correlation
    • None
    • Correlation Engine
    • False
    • False
    • NEW
    • To Do
    • Impediment
    • NEW
    • 0% To Do, 0% In Progress, 100% Done
    • Release Note Not Required

      Goals

      A cluster service with  a REST API that provides correlated data to clients that need correlated data (console components or CLI tools)

      • Accepts correlation context as query
      • Applies correlation rules to form queries to multiple back-ends; signal stores and cluster APIs.
      • Returns queries to caller, caller executes.

      Correlation API

      The API must:

      • Define a correlation context, including:
        • time window
        • related resources
        • alert and metric types
        • desired result type(s) - signals, resources
        • should be extensible to handle additional context data types
      • Define result format(s).
        • Results must use streams and/or iterators for high-volume results (logs, traces)
        • See open questions on result format.

      Meta-data dictionaries

      A  uniform data model may be a long term goal, but currently we have (at least)  these distinct data models:

      • Open Telemetry -  used by tracing, candidate for eventual  unified model.
      • Viaq] - used by logging.
      • Openshift/K8s/Prometheus - used by alerts/metrics - no formal spec I'm aware of but there are strong conventions used to name metrics and metric labels labels in k8s and openshift, based on the Prometheus style.

      We need dictionaries to translate between these data models so we can

      • formulate native queries for each back-end.
      • normalize results returned from each back-end.

      Correlation Engine Implementation

      The engine should start small but be able to grow by adding:

      • New back-ends (initially Loki, Prometheus, Jaeger, K8s events and resources)
      • New correlation rules; ways to handle specific types of context more accurately.
      • New data model dictionaries.

      Extensions may be static code, plug-ins, declarative data or some combination.

      Whatever the form, the engine must have clear extension points.

      Non-Goals

      See OBSDA-110

      Motivation

      See OBSDA-110

      Alternatives

      Rely on 3rd party tools to develop to meet our customers correlation needs.

      Acceptance Criteria

      This Epic is complete when we have a correlation engine as described above that is sufficient to be released as GA.

      The engine will be built so that we can have multiple "checkpoint" releases, either internally or for customer preview.

      We will refine our idea ofwhat "GA ready" means by experimenting and getting feedback on those  checkpoints.

      Risk and Assumptions

      Risk: Ambitious, risks getting bogged down.

      Remedy: start small and grow

      • Small minimum feature set for initial preview, grow incrementally.
      • GA when preview feedback indicates we have "enough".
      • Keep growing in following releases.

      Risk: Correlation results are not statifactory.

      Remedy: Early investigation suggests we can produce valuable correlations.
      No way to be sure till we try it.

      Risk: Can't compete/keep up with 3rd party tools on advanced features.

      Remedy: See LOG-1779 Value Proposition

      Documentation Considerations

      Documentation of (or self-documenting) console components for console-based use.

      Documentation of query language, result formats etc. for CLI-based use if we decide to support that.

      Open Questions

      Should results be:

      • normalized to a single consistent format and data model?  (Open Telemetry?) 
      • returned in native form for each signal type?
      • allow both based on user preference:
        • New users likely to prefer consistent output.
        • Existing  users have their own tools/queries based on existing data models.

      Additional Notes

              rhn-engineering-aconway Alan Conway
              rhn-engineering-aconway Alan Conway
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: