Type: Feature
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Correlation, Data Analytics, Data Visualization, PM Analytics, PM Obs-UI
Labels:
- korrel8r
- pm_ack+

Blocked:
False
Ready:
False

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Use cases

As an SRE receiving an Alert, I want to

Quickly navigate to relevant run-books, logs, metrics, traces, events and resource status to understand the problem.
Easily navigate between all these different views to gather information when:
- the runbook solution does not apply or does not work, so I have to dig deeper.
- I want to verify the real situation matches the runbook's assumptions before applying it.
Quickly jump to relevant resources specs to take necessary actions (for example, killing stuck pods so they can restart)

As an SRE with a container/pod/service that is known to be in trouble, I want to:

Same as previous use-case but working from a resource as the initial "correlation context" rather than an alert.

These are the initial focus on these use-cases, there are many other potential uses of signal and resource correlation.

Overview

Correlation means allowing the user to retrieve relevant signal and resource data across all types of signal given an initial correlation context such as an alert or a resource.

Signals

A signal is data periodically emitted or collected from cluster components, providing a history of events or data-points over time.

The traditional observable signals in a cluster include:

Logs (text records emitted by containers, structured or unstructured records)
Metrics (numeric values collected periodically)
Alerts (structured records indicating an important transition in metric values)
Traces (tree-structured records of function calls or network requests, traceable across multiple containers)

We also consider these to be signals:

K8s Events - effectively these are structure logs stored as API objects instead of log file records.
Network Flow Events (coming soon, records of network-level events)

The following objects are not signals (don't provide a history of values) but can be correlated with signals:

Resources - references to pods, containers, persistentvolumes etc. are the links between correlated data
Runbooks - contain identifiable actions (e.g. oc delete pod suchandsuch) that provide:
- links to relevant resources
- actions on those resources that can be automated as a single click, instead of commands to be copied to a terminal.

Types of correlation

Most correlations will involve selecting signals in a restricted time window .

Alerts refer to resources which can be linked to signals by using resource attributes to query for related signal meta-data - signals carry metadata such as origin pod or container, k8s labels etc.

An important part of this feature will be to create a database of interesting resources and relationships (e.g. pod contains container, persistentvolume is-mounted-by container) and to use those to construct correlation rules, for example

If an alert (or its runbook) refers to workload resource then logs/traces from the most specific resource (deployment > pod > container) workload are interesting
If alert/runbook refers to a persistentvolume then k8s events referring to that PV and logs/traces from containers that mount that PV are interesting
If a context resource is referred to by a metric or alerting rule, then the most specific (best match for resource name < namespace < type etc.) metric/alert histories are interesting

There are two components: UI elements that allow the user to navigate between correlated data, and a "correlation engine" that can take a correlation context and return correlated data from multiple sources (log store, metric database, k8s resources etc.) The engine will provide an API that supports the UI, but can also be used from other clients.

Goals

A console with correlation capabilities suited to openshift/k8s:

For an openshift cluster we provide a complete observation console out of the box.
Our tools correlate between all the observable signals, alerts and resources generated by openshift.
Our tools use detailed k8s/openshift domain rules to give more accurate, focused correlations than generic label matching techniques.

Our tools can complement self-managed 3rd visualization tools:

Use self-managed 3rd party tools to visualize the cluster data directly.
- We provide access to open format data (loki, prometheus etc.)
Use our console tools to diagnose a problem, 3rd party tools to look deeper.
- Our tools give superior navigation and correlation within openshift
- We provide contextual links to jum to relevant views in a 3rd party tool at any point.

Also see the linked Epics.

Non-goals

We are not trying to compete with Grafana, Data Dog, Jaeger and other specialized data visualization tools.

they focus on visualizing of arbitrary signal data, and may have have superior visualization and analysis capabilities
we focus on visualizing k8s/openshift cluster data, and can use domain specific knowledge of signals to improve correlation results.

We are not trying to prevent or discourage use of 3rd party tools, but provide an alternative with unique capabilities in the openshift/k8s context.

incorporates

LOG-2132 Correlation service for observability data

Closed

LOG-2133 Console views for correlated signal data.

Closed

OBSDA-8 Allow log exploration natively inside the OpenShift Console to reduce the number of UIs users need to access and use to a bare minimum

Closed

relates to

OU-306 Correlation of observability signals

Closed

links to

Initial correlation presentation

Use cases for correlating alerts

(1 links to)

Details

Description

Use cases

Overview

Signals

Types of correlation

Goals

Non-goals

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates