XML

Word

Printable

Type: Story
Resolution: Done
Priority: Blocker
Fix Version/s: MCE 2.2.0, ACM 2.7.0
Affects Version/s: ACM 2.7.0
Component/s: HyperShift, QE
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Epic Goal

Define SLIs for components that will be used by Service Delivery.
Code instrumentation for agreed upon SLIs, expose metrics.
Define alerting rules for SLIs.
Determine starting SLO based on aggregation of our SLIs.

Why is this important?

Meet SLA requirements that will be established as part of SD.
Service monitoring and alerting will be essential for quick RCA and resolution for service disruptions across environments.

Scenarios

Metric type	PagerDuty	Name	Description	Equation gate to pager duty
Bool (binary)	YES	Addon controller	If pod is healthy returns 0, otherwise returns 1	Skips the pager duty call if Installation is 1
Bool (binary)	YES	Hypershift-operator	If pod is healthy returns 0, otherwise returns 1	Skips the pager duty call if Installation is 1
Bool (binary)	YES	External DNS	If pod is healthy returns 0, otherwise returns 1	Skips the pager duty call if Installation is 1
Count	NO	Restart count (24hrs)	Is the number of restarts in 24 hours	Pick a threshold?!?
Bool (binary)	NO	Installation / Upgrade	If an Installation (upgrade) is occurring return 1, otherwise returns 0

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Prometheus

Previous Work (Optional):

Hypershift addon document
- https://docs.google.com/document/d/1wnj49GoBMxez-bnDJ-Qz3WR3bptNssoetyA8NdXeYuw/edit?usp=sharing

Open questions::

Are there a set of signals SLI's that service delivery requires or suggests?
How many of the signals can be just rules? (no code change required)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build{}

clones

ACM-2163 Service monitoring and alerting for ACM

Closed

is related to

ACM-2467 Hypershift total HC metrics not showing values

Closed

Assignee:: Roke Jung

Reporter:: Joshua Packer

Manager:: Juliana Hsu (Inactive)

Technical Lead:: Roke Jung

QA Contact:: David Huynh

Architect:: Joydeep Banerjee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/11/22 3:45 AM

Updated:: 2022/12/19 8:47 AM

Resolved:: 2022/12/19 8:47 AM

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates