-
Story
-
Resolution: Done
-
Critical
-
None
Value:
For Managed Services deployment of Hypershift clusters, we need to create SLO/SLI and alerts for Hypershift addon agent because it is in the critical path hosted cluster creation.
Hypershift add-on manager metrics here
https://github.com/stolostron/hypershift-addon-operator/blob/main/docs/advanced/prometheus_metrics.md
if mce_hs_addon_install_in_progress_bool=0 and (mce_hs_addon_hypershift_operator_degraded_bool=1 or mce_hs_addon_ext_dns_operator_degraded_bool=1), hypershift operator is not available
From rokejungrh
With the following count metrics, we should get a rate of failure in 10 minutes and generate alert of the rate goes over a set threshold rate
* mce_hs_addon_placement_score_failure_count
- mce_hs_addon_cluster_claims_failure_count
- mce_hs_addon_hub_sync_failure_count
- mce_hs_addon_kubeconfig_secret_copy_failure_count
Josh & Roke created the SLO dashboard here:
https://grafana.stage.devshift.net/d/87f7f256a3506f65da8694b290e8d8e4/acm-hypershift-addon[…]-cluster=&var-datasource=hypershift-observatorium-stage
The dashboard has no data pending monitoring stack setup needed to send metrics to RHOBS.
Definition of Done for Engineering Story Owner (Checklist)
- ...
Development Complete
- [ ] The code is complete. PR accepted in repo - https://grafana.stage.devshift.net/d/87f7f256a3506f65da8694b290e8d8e4/acm-hypershift-addon[…]-cluster=&var-datasource=hypershift-observatorium-stage
- [ ] The SLO/SLI can be seen using Promlens - https://promlens.stage.devshift.net/ .