XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: Logging 5.5.0
Affects Version/s: None
Component/s: Data Visualization, PM Obs-UI
Labels:
- pm_ack+
- ui-consideration

Blocked:
False
Ready:
False
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:
Undefined

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Problem Alignment

The Problem

Today when SREs receive a page for a critical alert, they may need to look up information on different systems each with their own user experience. That is, for example:

Details of the alert inside the OpenShift Console
Log information inside Kibana

Between these tools, someone needs to copy/paste metadata to find only the relevant data. That ends up being very tedious for SREs who quickly need to find the root cause because it requires multiple unnecessary clicks/steps. Even with the metadata, anyone needs to remember the appropriate query to get the relevant logs in Kibana which might already be challenging if you were just paged around 3am. All that could be avoided.

High-Level Approach

Expose log information from the underlying storage via an API that can be queried by the Console to retrieve contexturalized logs. For this specific use case, the idea is to use specific labels attached to every alert (the metadata such as namespace or time frame) to identify the important logs and display them inside the same view so that users do not have to go, copy and paste that metadata into a query inside Kibana.

Goal & Success

Reduce the number of steps to get the relevant logs to a minimum.

Solution Alignment

Key Capabilities

Provide a well-defined API to retrieve log information from the underlying log management store so that it can be used by various tools for further processing (e.g. Console to display log certain information).
Display the number of logs at a particular time inside the graph shown on the alert details page to quickly identify problem areas.
Allow users to view contextualized log information related to an individual alert.
Further filter the current displayed logs to easier spot the relevant information and keep out the noise.

Key Flows

AppSRE gets paged around 3am in the night and opens Slack where they received a critical alert notification. They click on the provided link and are forwarded to the alerts details page inside the OpenShift Console. The displayed chart shows a significant amount of logs produced at a particular timeframe (displayed as a bar) very close to when the defined alert threshold was breached. By clicking on the bar, the AppSRE now sees only the logs that happened at this timeframe. For better visibility, they choose to further filter the logs to only show "error" logs.

Open Questions & Key Decisions (optional)

is blocked by

PD-972 Observability - Show alert-related logs - High Level Design

To Do

Assignee:: Vanessa Martini

Reporter:: Christian Heidenreich (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2021/07/13 8:18 AM

Updated:: 2024/05/03 3:29 PM

Details

Description

Problem Alignment

The Problem

High-Level Approach

Goal & Success

Solution Alignment

Key Capabilities

Key Flows

Open Questions & Key Decisions (optional)

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates