-
Story
-
Resolution: Done
-
Normal
-
None
-
8
-
False
-
None
-
False
-
KONFLUX-123 - Konflux Availability SLO phase 1
-
Release Note Not Required
-
-
-
Pipelines Sprint Pioneers 7, Pipelines Sprint Pioneers 8, Pipelines Sprint Pioneers 9, Pipelines Sprint Pioneers 10
Story (Required)
As a <PERSONA> trying to <ACTION> I want <THIS OUTCOME>
As an cluster administrator or SRE trying to maintain a cluster with openshift pipelines, I want to be able to easily visualize where stability metrics are past acceptable thresholds
<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>
Background (Required)
Per the conventions for Konflux, we currently
- define panels that run in the cluster being monitored, displaying the prometheus queries that serve as the basis for alerts; those are currently hosted in https://github.com/openshift-pipelines/pipeline-service ; these would be what we productize into openshift pipelines
- also define panels in https://github.com/redhat-appstudio/o11y per process described at https://github.com/redhat-appstudio/o11y/?tab=readme-ov-file#grafana-dashboards ; these dashboards get mapped to the grafana system App SRE uses to monitor all the clusters under its purview
- define the alerts also in https://github.com/redhat-appstudio/o11y per the process described under https://github.com/redhat-appstudio/o11y/?tab=readme-ov-file#alerting-rules based on the metrics delivered in https://issues.redhat.com/browse/SRVKP-4522 that we have monitored in prod sufficient to determine what are acceptable alert thresholds per the process ; see https://github.com/redhat-appstudio/o11y/blob/main/rhobs/alerting/data_plane/prometheus.pipeline_alerts.yaml for the existing alerts ;
- presumably these alerts will be a starting point for what we would deliver in openshift pipelines, though what we deliver most likely would be optional and configurable wrt precise thresholds
<Describes the context or background related to this story>
Out of scope
<Defines what is not included in this story>
Approach (Required)
<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>
Dependencies
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Acceptance Criteria (Mandatory)
<Describe edge cases to consider when implementing the story and defining tests>
<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>
Done Checklist
- Code is completed, reviewed, documented and checked in
- Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
- Continuous Delivery pipeline(s) is able to proceed with new code included
- Customer facing documentation, API docs etc. are produced/updated, reviewed and published
- Acceptance criteria are met
- blocks
-
SRVKP-4524 Update pipeline service SOPs in gitlab/app-interface, get tiger team/infra sign off, for core controllera deadlocked metrics, rapid restarts
- Closed
- is blocked by
-
SRVKP-4522 build metric to determine of core tekton controller is not creating pods for pipelines, determine if it is deadlocked
- Closed
- links to
- mentioned on