-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
5
-
False
-
None
-
False
-
KONFLUX-123 - Konflux Availability SLO phase 1
-
-
Story (Required)
As a <PERSONA> trying to <ACTION> I want <THIS OUTCOME>
As a cluster admin or service first engineering team for openshift pipelines on konflux, I want to be able to properly define to App SRE's tiger team how to interpret and triage our various alerts when they fire.
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>
Background (Required)
See https://docs.google.com/presentation/d/17aRQbg-EjjL8yyzT2YTNe80Ubtb0Dt93t9tmDnW_qPM/edit#slide=id.p for the initial presentation to app sre tiger team and konflux infra on our SLO/SLI/SLAs
The current doc locations that we will add updates to as a result of our metrics and progression in providing debug
- https://gitlab.cee.redhat.com/service/app-interface/-/tree/master/data/services/stonesoup/pipeline-service?ref_type=heads
- https://gitlab.cee.redhat.com/konflux/docs/sop/-/tree/main/pipeline-service/slos?ref_type=heads
- https://gitlab.cee.redhat.com/konflux/docs/sop/-/tree/main/pipeline-service?ref_type=heads
- https://gitlab.cee.redhat.com/plnsvc/bugs
<Describes the context or background related to this story>
Out of scope
<Defines what is not included in this story>
Approach (Required)
Some key items to make sure our documented, aside from the new alerts
- tuning options for controller threads, k8s qps, burst for not just core controller, but chains and results not available from our gitops repos
- upcoming webhook tuning
- soon to be ready enablement of pprof minimally on results, and perhaps across the board since golang says it is safe to run in production, so app sre can get goroutine dumps for deadlocks
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
Dependencies
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Acceptance Criteria (Mandatory)
<Describe edge cases to consider when implementing the story and defining tests>
<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>
Done Checklist
- Code is completed, reviewed, documented and checked in
- Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
- Continuous Delivery pipeline(s) is able to proceed with new code included
- Customer facing documentation, API docs etc. are produced/updated, reviewed and published
- Acceptance criteria are met
- clones
-
SRVKP-4524 Update pipeline service SOPs in gitlab/app-interface, get tiger team/infra sign off, for core controllera deadlocked metrics, rapid restarts
- Closed
- is blocked by
-
SRVKP-4530 build or expose metrics to determine if pac watcher/controller is deadlocked or performance severely degraded
- Closed
- is cloned by
-
SRVKP-5899 Update pipeline service SOPs in gitlab/app-interface, get tiger team sign off, for deadlocked metrics, anything else added, chains
- Closed
- mentioned on