Uploaded image for project: 'Observability UI'
  1. Observability UI
  2. OU-306

Correlation of observability signals

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • TDB
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • NEW
    • To Do
    • QE Needed, Docs Needed, TE Needed, Customer Facing, PX Needed
    • NEW

      Use cases

      As an SRE receiving an Alert, I want to

      • Quickly navigate to relevant run-books, logs, metrics, traces, events and resource status to understand the problem.
      • Easily  navigate between all these different views to gather information when:
        • the runbook solution does not apply or does not work, so I have to dig deeper.
        • I want to verify the real situation matches the runbook's assumptions before applying it.
      • Quickly jump to relevant resources specs to take necessary actions (for example, killing stuck pods so they can restart)

      As an SRE with a container/pod/service that is known to be in trouble, I want to:

      • Same as previous use-case but working from a resource as the initial "correlation context" rather than an alert.

      These are the initial focus on these use-cases, there are many other potential uses of signal and resource correlation.

      Overview

      Correlation means allowing the user to retrieve relevant signal and resource data across all types of signal given an initial correlation context such as an alert or a resource.

      Signals

      signal is data periodically emitted or collected from cluster components, providing a history of events or data-points over time.

      The traditional observable signals in a cluster include:

      • Logs (text records emitted by containers, structured or unstructured records)
      • Metrics (numeric values collected periodically)
      • Alerts (structured records indicating an important transition in metric values) 
      • Traces (tree-structured records of function calls or network requests, traceable across multiple containers)

      We also consider these to be signals:

      • K8s Events - effectively these are structure logs stored as API objects instead of log file records.
      • Network Flow Events (coming soon, records of network-level events)

      The following objects are not signals (don't provide a history of values) but can be correlated with signals:

      • Resources - references to pods, containers, persistentvolumes etc. are the links between correlated data
      • Runbooks - contain identifiable actions (e.g. oc delete pod suchandsuch) that provide:
        • links to relevant resources
        • actions on those resources that can be automated as a single click, instead of commands to be copied to a terminal.

      Types of correlation

      Most correlations will involve selecting signals in a restricted time window .

      Alerts refer to resources which can be linked to signals by using resource attributes to query for related signal meta-data - signals carry metadata such as origin pod or container, k8s labels etc.

      An important part of this feature will be to create a database of interesting resources and relationships (e.g. pod contains container, persistentvolume is-mounted-by container) and to use those to construct correlation rules, for example

      • If an alert (or its runbook) refers to workload resource then logs/traces from the most specific resource (deployment > pod > container) workload are interesting
      • If alert/runbook refers to a persistentvolume then k8s events referring to that PV and logs/traces from containers that mount that PV are interesting
      • If a context resource is referred to by a metric or alerting rule, then the most specific (best match for resource name < namespace < type etc.) metric/alert histories are interesting

      There are two components: UI elements that allow the user to navigate between correlated data, and a "correlation engine" that can take a correlation context and return correlated data from multiple sources (log store, metric database, k8s resources etc.) The engine will provide an API that supports the UI, but can also be used from other clients.

      Goals

      A console with correlation capabilities suited to openshift/k8s:

      1. For an openshift cluster we provide a complete observation console out of the box.
      2. Our tools correlate between all the observable signals, alerts and resources generated by openshift.
      3. Our tools use detailed k8s/openshift domain rules to give more accurate, focused correlations than generic label matching techniques.

      Our tools can complement self-managed 3rd visualization tools:

      • Use  self-managed 3rd party tools to visualize the cluster data directly.
        • We provide access to open format data (loki, prometheus etc.)
      • Use our console tools to diagnose a problem, 3rd party tools to look deeper.
        • Our tools give superior navigation and correlation within openshift 
        • We provide contextual links to jum to relevant views in a 3rd party tool at any point.

      Also see the linked Epics.

      Non-goals

      We are not trying to compete with Grafana, Data Dog, Jaeger and other specialized data visualization tools.

      • they focus on visualizing of arbitrary signal data, and may have have superior visualization and analysis capabilities
      • we focus on visualizing k8s/openshift cluster data, and can use domain specific knowledge of signals to improve correlation results.

      We are not trying to prevent or discourage  use of 3rd party tools, but provide an alternative with unique capabilities in the openshift/k8s context.

       

       

              rhn-engineering-aconway Alan Conway
              rhn-engineering-aconway Alan Conway
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: