-
Epic
-
Resolution: Done
-
Major
-
None
-
None
-
SLO effort
-
False
-
None
-
False
-
Not Selected
-
To Do
-
0% To Do, 0% In Progress, 100% Done
-
M
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
Epic Goal
- Create and propagate telemetry to generate SLO's in dashboards for use by Service Deliver and the business
Why is this important?
- Installation of clusters and the amount of time is critical to understand from the business and customer perspective. Having a few critical well thought out SLO's can indicate that the behavior of installation is acting as expected without looking at many different parameters makes management and maintenance of the application in a managed services environment scale more effectively. Without this, more manual effort is spent managing, maintaining, monitoring, and investigating potential issues.
Scenarios
- What % of clusters installed successfully over the last Xhours/days? (New metric requiring effort)
- What is the p95 (or available buckets) duration of the successful installations (Understood)
- What is the p95(or available buckets) duration of the failed installations (Understood)
- Of the clusters that retried - what percentage were successful, what percentage continued to fail? (Suhani Mehta investigating how we can get a value to indicate the count of retry)
- Provider label for metrics to distinguish gcp from ARO, etc (Understood)
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- SLO's accepted
- Reports available in Grafana
- Alarms will be triggered when SLO's are not met
Dependencies (internal and external)
- Work with installer team to get to consensus of overall install SLO and gain acceptance with SD, product management on SD and engineering
- Aggregation in observatorium effort could be handled outside engineering in SD but to be determined
Previous Work (Optional):
- …
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>