Today when SREs receive a page for a critical alert, they may need to look up information on different systems each with their own user experience. That is, for example:
- Details of the alert inside the OpenShift Console
- Log information inside Kibana
Between these tools, someone needs to copy/paste metadata to find only the relevant data. That ends up being very tedious for SREs who quickly need to find the root cause because it requires multiple unnecessary clicks/steps. Even with the metadata, anyone needs to remember the appropriate query to get the relevant logs in Kibana which might already be challenging if you were just paged around 3am. All that could be avoided.
Expose log information from the underlying storage via an API that can be queried by the Console to retrieve contexturalized logs. For this specific use case, the idea is to use specific labels attached to every alert (the metadata such as namespace or time frame) to identify the important logs and display them inside the same view so that users do not have to go, copy and paste that metadata into a query inside Kibana.
- Reduce the number of steps to get the relevant logs to a minimum.
- Provide a well-defined API to retrieve log information from the underlying log management store so that it can be used by various tools for further processing (e.g. Console to display log certain information).
- Display the number of logs at a particular time inside the graph shown on the alert details page to quickly identify problem areas.
- Allow users to view contextualized log information related to an individual alert.
- Further filter the current displayed logs to easier spot the relevant information and keep out the noise.
AppSRE gets paged around 3am in the night and opens Slack where they received a critical alert notification. They click on the provided link and are forwarded to the alerts details page inside the OpenShift Console. The displayed chart shows a significant amount of logs produced at a particular timeframe (displayed as a bar) very close to when the defined alert threshold was breached. By clicking on the bar, the AppSRE now sees only the logs that happened at this timeframe. For better visibility, they choose to further filter the logs to only show "error" logs.