Uploaded image for project: 'Konflux UI'
  1. Konflux UI
  2. KFLUXUI-456

Konflux UI SLO Dashboard and SOPs

XMLWordPrintable

    • Konflux UI SLO Dashboard and SOPs
    • 15
    • False
    • Hide

      None

      Show
      None
    • False
    • Konflux
    • Targeted
    • In Progress
    • KONFLUX-8640 - Konflux SLO incident response - dashboards and SOPs
    • 0% To Do, 0% In Progress, 100% Done

      This is stage 2 of a 2-stage strategy to meet the requirements of KONFLUX-4930. Stage 1 was covered in SPRE-1357.

      In this second stage, we will focus on building the SLO dashboard and SOPs needed to diagnose and recover from known failure modes in the Console. We will expand the scope beyond the metrics included in the konflux_up signal to present SREs and service teams with any additional information that would be useful in diagnosing the service.

      Requirements

      • Identify platform dependencies for the service, such as CPU / memory consumption, disk usage, and important external services or network links
      • Identify service dependencies of this service, such as external cloud providers, which might have metrics we can use here
      • Identify metrics related to recent execution history, such as per-architecture success rate
      • Identify important metrics related to latency and throughput, so SREs can see load on the system
      • Identify mechanisms for measuring or otherwise exposing these metrics for monitoring
      • Build a dashboard definition that includes all the signals we need to monitor for the service{}
      • Analyze failure modes on the monitored signals to identify good candidates for SOPs
      • Write SOPs that account for well-known failure modes
      • Include more general diagnostic guidance for less well-known problems in the SOPs (possibly in a general-purpose Argo diagnostic SOP)

      Acceptance Criteria

      • SLO Dashboard is defined and available in RHOBS (depending on what SPRE has access to refine freely)
        • Dashboard includes panels for as many relevant signals as we can reasonably extract, and which will give useful information about service health
        • Dashboard may be broken into multiple dashboards to maximize reponsiveness, but in a multi-dashboard scenario they should be linked together with timespans preserved to allow switching back and forth rapidly
      • SOPs are written for all well known failure modes, and include links to the SLO dashboard(s) and links to Splunk log queries as appropriate
        • General diagnostic guidance is included, either as a subsection on troubleshooting that pertains to each specific SOP or as a standalone SOP for the service.

              rh-ee-jpolonip Joao Pedro Poloni Ponce
              rhtap-jira-bot RHTAP Jira Bot
              Stanley Jochman
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: