-
Story
-
Resolution: Done
-
Blocker
-
ACM 2.7.0
Epic Goal
- Define SLIs for components that will be used by Service Delivery.
- Code instrumentation for agreed upon SLIs, expose metrics.
- Define alerting rules for SLIs.
- Determine starting SLO based on aggregation of our SLIs.
Why is this important?
- Meet SLA requirements that will be established as part of SD.
- Service monitoring and alerting will be essential for quick RCA and resolution for service disruptions across environments.
Scenarios
Metric type | PagerDuty | Name | Description | Equation gate to pager duty |
Bool (binary) | YES | Addon controller | If pod is healthy returns 0, otherwise returns 1 | Skips the pager duty call if Installation is 1 |
Bool (binary) | YES | Hypershift-operator | If pod is healthy returns 0, otherwise returns 1 | Skips the pager duty call if Installation is 1 |
Bool (binary) | YES | External DNS | If pod is healthy returns 0, otherwise returns 1 | Skips the pager duty call if Installation is 1 |
Count | NO | Restart count (24hrs) | Is the number of restarts in 24 hours | Pick a threshold?!? |
Bool (binary) | NO | Installation / Upgrade | If an Installation (upgrade) is occurring return 1, otherwise returns 0 |
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
Dependencies (internal and external)
- Prometheus
Previous Work (Optional):
- Hypershift addon document
Open questions::
- Are there a set of signals SLI's that service delivery requires or suggests?
- How many of the signals can be just rules? (no code change required)
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build{}