Status: In Progress
A cluster service with a REST API that provides correlated data to clients that need correlated data (console components or CLI tools)
- Accepts correlation context as query
- Applies correlation rules to form queries to multiple back-ends; signal stores and cluster APIs.
- Returns queries to caller, caller executes.
The API must:
- Define a correlation context, including:
- time window
- related resources
- alert and metric types
- desired result type(s) - signals, resources
- should be extensible to handle additional context data types
- Define result format(s).
- Results must use streams and/or iterators for high-volume results (logs, traces)
- See open questions on result format.
A uniform data model may be a long term goal, but currently we have (at least) these distinct data models:
- Open Telemetry - used by tracing, candidate for eventual unified model.
- Viaq] - used by logging.
- Openshift/K8s/Prometheus - used by alerts/metrics - no formal spec I'm aware of but there are strong conventions used to name metrics and metric labels labels in k8s and openshift, based on the Prometheus style.
We need dictionaries to translate between these data models so we can
- formulate native queries for each back-end.
- normalize results returned from each back-end.
The engine should start small but be able to grow by adding:
- New back-ends (initially Loki, Prometheus, Jaeger, K8s events and resources)
- New correlation rules; ways to handle specific types of context more accurately.
- New data model dictionaries.
Extensions may be static code, plug-ins, declarative data or some combination.
Whatever the form, the engine must have clear extension points.
Rely on 3rd party tools to develop to meet our customers correlation needs.
This Epic is complete when we have a correlation engine as described above that is sufficient to be released as GA.
The engine will be built so that we can have multiple "checkpoint" releases, either internally or for customer preview.
We will refine our idea ofwhat "GA ready" means by experimenting and getting feedback on those checkpoints.
Risk: Ambitious, risks getting bogged down.
Remedy: start small and grow
- Small minimum feature set for initial preview, grow incrementally.
- GA when preview feedback indicates we have "enough".
- Keep growing in following releases.
Risk: Correlation results are not statifactory.
Remedy: Early investigation suggests we can produce valuable correlations.
No way to be sure till we try it.
Risk: Can't compete/keep up with 3rd party tools on advanced features.
Remedy: See LOG-1779 Value Proposition
Documentation of (or self-documenting) console components for console-based use.
Documentation of query language, result formats etc. for CLI-based use if we decide to support that.
Should results be:
- normalized to a single consistent format and data model? (Open Telemetry?)
- returned in native form for each signal type?
- allow both based on user preference:
- New users likely to prefer consistent output.
- Existing users have their own tools/queries based on existing data models.