Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Critical
Fix Version/s: ACM 2.8.0
Affects Version/s: None
Component/s: Observability
Labels:

Story Points:
5
Blocked:
False
Ready:
False
Epic Link:
SD SLO/SLI/Alert for ACM
Acceptance Criteria:

Hide
- SLO/SLI/Alerts can be seen using Promlens: https://promlens.stage.devshift.net/ by using PromQL
- Simulate a failure condition (what should be failed will depend on the SLO/SLI definition) and make sure that it gets reflected in the SLO/SLI value accurately.
- Simulate failure conditions and make sure alerts pick them up.

Show
- SLO/SLI/Alerts can be seen using Promlens: https://promlens.stage.devshift.net/ by using PromQL - Simulate a failure condition (what should be failed will depend on the SLO/SLI definition) and make sure that it gets reflected in the SLO/SLI value accurately. - Simulate failure conditions and make sure alerts pick them up.
Intelligence Requested:
Market:

Sprint:
Observability Sprint 2023-04

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Value:

For Managed Services deployment of Hypershift clusters, we need to create SLO/SLI and alerts for Hypershift addon agent because it is in the critical path hosted cluster creation.

Hypershift add-on manager metrics here
https://github.com/stolostron/hypershift-addon-operator/blob/main/docs/advanced/prometheus_metrics.md

if mce_hs_addon_install_in_progress_bool=0 and (mce_hs_addon_hypershift_operator_degraded_bool=1 or mce_hs_addon_ext_dns_operator_degraded_bool=1), hypershift operator is not available

From rokejungrh
With the following count metrics, we should get a rate of failure in 10 minutes and generate alert of the rate goes over a set threshold rate
* mce_hs_addon_placement_score_failure_count

mce_hs_addon_cluster_claims_failure_count
mce_hs_addon_hub_sync_failure_count
mce_hs_addon_kubeconfig_secret_copy_failure_count

Josh & Roke created the SLO dashboard here:
https://grafana.stage.devshift.net/d/87f7f256a3506f65da8694b290e8d8e4/acm-hypershift-addon[…]-cluster=&var-datasource=hypershift-observatorium-stage

The dashboard has no data pending monitoring stack setup needed to send metrics to RHOBS.

Definition of Done for Engineering Story Owner (Checklist)

Development Complete

[ ] The code is complete. PR accepted in repo - https://grafana.stage.devshift.net/d/87f7f256a3506f65da8694b290e8d8e4/acm-hypershift-addon[…]-cluster=&var-datasource=hypershift-observatorium-stage
[ ] The SLO/SLI can be seen using Promlens - https://promlens.stage.devshift.net/ .

is cloned by

ACM-2817 Alerts for ServerFoundation addon

Closed

relates to

ACM-3292 Hypershift addon SOP

Assignee:: Disaiah Bennett

Reporter:: Joydeep Banerjee

QA Contact:: Xiang Yin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/01/11 5:14 AM

Updated:: 2023/04/20 2:43 PM

Resolved:: 2023/04/20 2:40 PM

Details

Description

Value:

Definition of Done for Engineering Story Owner (Checklist)

Development Complete

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates