Type: Epic
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
Correlation of observability signals
Epic Status:
To Do
Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:
None
Release Blocker:
None

Use cases

As an SRE receiving an Alert, I want to

Quickly navigate to relevant run-books, logs, metrics, traces, events and resource status to understand the problem.
Easily navigate between all these different views to gather information when:
- the runbook solution does not apply or does not work, so I have to dig deeper.
- I want to verify the real situation matches the runbook's assumptions before applying it.
Quickly jump to relevant resources specs to take necessary actions (for example, killing stuck pods so they can restart)

As an SRE with a container/pod/service that is known to be in trouble, I want to:

Same as previous use-case but working from a resource as the initial "correlation context" rather than an alert.

These are the initial focus on these use-cases, there are many other potential uses of signal and resource correlation.

Overview

Correlation means allowing the user to retrieve relevant signal and resource data across all types of signal given an initial correlation context such as an alert or a resource.

Signals

A signal is data periodically emitted or collected from cluster components, providing a history of events or data-points over time.

The traditional observable signals in a cluster include:

Logs (text records emitted by containers, structured or unstructured records)
Metrics (numeric values collected periodically)
Alerts (structured records indicating an important transition in metric values)
Traces (tree-structured records of function calls or network requests, traceable across multiple containers)

We also consider these to be signals:

K8s Events - effectively these are structure logs stored as API objects instead of log file records.
Network Flow Events (coming soon, records of network-level events)

The following objects are not signals (don't provide a history of values) but can be correlated with signals:

Resources - references to pods, containers, persistentvolumes etc. are the links between correlated data
Runbooks - contain identifiable actions (e.g. oc delete pod suchandsuch) that provide:
- links to relevant resources
- actions on those resources that can be automated as a single click, instead of commands to be copied to a terminal.

Types of correlation

Most correlations will involve selecting signals in a restricted time window .

Alerts refer to resources which can be linked to signals by using resource attributes to query for related signal meta-data - signals carry metadata such as origin pod or container, k8s labels etc.

An important part of this feature will be to create a database of interesting resources and relationships (e.g. pod contains container, persistentvolume is-mounted-by container) and to use those to construct correlation rules, for example

If an alert (or its runbook) refers to workload resource then logs/traces from the most specific resource (deployment > pod > container) workload are interesting
If alert/runbook refers to a persistentvolume then k8s events referring to that PV and logs/traces from containers that mount that PV are interesting
If a context resource is referred to by a metric or alerting rule, then the most specific (best match for resource name < namespace < type etc.) metric/alert histories are interesting

There are two components: UI elements that allow the user to navigate between correlated data, and a "correlation engine" that can take a correlation context and return correlated data from multiple sources (log store, metric database, k8s resources etc.) The engine will provide an API that supports the UI, but can also be used from other clients.

Goals

A console with correlation capabilities suited to openshift/k8s:

For an openshift cluster we provide a complete observation console out of the box.
Our tools correlate between all the observable signals, alerts and resources generated by openshift.
Our tools use detailed k8s/openshift domain rules to give more accurate, focused correlations than generic label matching techniques.

Our tools can complement self-managed 3rd visualization tools:

Use self-managed 3rd party tools to visualize the cluster data directly.
- We provide access to open format data (loki, prometheus etc.)
Use our console tools to diagnose a problem, 3rd party tools to look deeper.
- Our tools give superior navigation and correlation within openshift
- We provide contextual links to jum to relevant views in a 3rd party tool at any point.

Also see the linked Epics.

Non-goals

We are not trying to compete with Grafana, Data Dog, Jaeger and other specialized data visualization tools.

they focus on visualizing of arbitrary signal data, and may have have superior visualization and analysis capabilities
we focus on visualizing k8s/openshift cluster data, and can use domain specific knowledge of signals to improve correlation results.

We are not trying to prevent or discourage use of 3rd party tools, but provide an alternative with unique capabilities in the openshift/k8s context.

is related to

OBSDA-110 Correlation of observability signals

Closed

Details

Description

Use cases

Overview

Signals

Types of correlation

Goals

Non-goals

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates