XML

Word

Printable

Epic Goal

Create and propagate telemetry to generate SLO's in dashboards for use by Service Deliver and the business

Why is this important?

Installation of clusters and the amount of time is critical to understand from the business and customer perspective. Having a few critical well thought out SLO's can indicate that the behavior of installation is acting as expected without looking at many different parameters makes management and maintenance of the application in a managed services environment scale more effectively. Without this, more manual effort is spent managing, maintaining, monitoring, and investigating potential issues.

What % of clusters installed successfully over the last Xhours/days? (New metric requiring effort)
What is the p95 (or available buckets) duration of the successful installations (Understood)
What is the p95(or available buckets) duration of the failed installations (Understood)
Of the clusters that retried - what percentage were successful, what percentage continued to fail? (Suhani Mehta investigating how we can get a value to indicate the count of retry)
Provider label for metrics to distinguish gcp from ARO, etc (Understood)

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
SLO's accepted
Reports available in Grafana
Alarms will be triggered when SLO's are not met

Work with installer team to get to consensus of overall install SLO and gain acceptance with SD, product management on SD and engineering
Aggregation in observatorium effort could be handled outside engineering in SD but to be determined

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>