XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Critical
Fix Version/s: CY25Q3
Affects Version/s: None
Labels:
- cwf-product-configuration

Epic Name:
Konflux UI SLO Dashboard and SOPs
Story Points:
15
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Activity Category:
Konflux
Commitment:
Targeted
Epic Status:
In Progress
Feature Link:
KONFLUX-8640 - Konflux SLO incident response - dashboards and SOPs
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

This is stage 2 of a 2-stage strategy to meet the requirements of KONFLUX-4930. Stage 1 was covered in SPRE-1357.

In this second stage, we will focus on building the SLO dashboard and SOPs needed to diagnose and recover from known failure modes in the Console. We will expand the scope beyond the metrics included in the konflux_up signal to present SREs and service teams with any additional information that would be useful in diagnosing the service.

Requirements

Identify platform dependencies for the service, such as CPU / memory consumption, disk usage, and important external services or network links
Identify service dependencies of this service, such as external cloud providers, which might have metrics we can use here
Identify metrics related to recent execution history, such as per-architecture success rate
Identify important metrics related to latency and throughput, so SREs can see load on the system
Identify mechanisms for measuring or otherwise exposing these metrics for monitoring
Build a dashboard definition that includes all the signals we need to monitor for the service{}
Analyze failure modes on the monitored signals to identify good candidates for SOPs
Write SOPs that account for well-known failure modes
Include more general diagnostic guidance for less well-known problems in the SOPs (possibly in a general-purpose Argo diagnostic SOP)

Acceptance Criteria

SLO Dashboard is defined and available in RHOBS (depending on what SPRE has access to refine freely)
- Dashboard includes panels for as many relevant signals as we can reasonably extract, and which will give useful information about service health
- Dashboard may be broken into multiple dashboards to maximize reponsiveness, but in a multi-dashboard scenario they should be linked together with timespans preserved to allow switching back and forth rapidly
SOPs are written for all well known failure modes, and include links to the SLO dashboard(s) and links to Splunk log queries as appropriate
- General diagnostic guidance is included, either as a subsection on troubleshooting that pertains to each specific SOP or as a standalone SOP for the service.

links to

How to create dashboards in Grafana

redhat-appstudio/infra-deployments#6630: feat(KFLUXUI-504): Add banner content yaml and its kustomization files

redhat-appstudio/infra-deployments#7283: feat(KFLUXUI-635): expose konflux-ui metrics RHOBS

Assignee:: Joao Pedro Poloni Ponce

Reporter:: RHTAP Jira Bot

Contributors:: Stanley Jochman

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Due:: 2025/09/30

Created:: 2025/05/07 9:40 AM

Updated:: 2025/09/10 6:47 PM

Resolved:: 2025/09/08 1:20 PM

Target start:: 2025/07/01

Target end:: 2025/09/30

Details

Description

Requirements

Acceptance Criteria

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty