-
Epic
-
Resolution: Won't Do
-
Critical
-
None
-
None
-
None
-
Data Plane Observability
-
False
-
None
-
False
-
To Do
-
0% To Do, 0% In Progress, 100% Done
-
MGDOBR - Sprint 226
Issue Description:
Control Plane monitoring/alerting for RHOSE is provided through our deployment on app-interface. For the data plane, we are currently running with no monitoring/alerting.
This epic exists to introduce the framework for monitoring and alerting in the RHOSE data plane. The focus of this epic is a "steel-thread" implementation of our data plane monitoring. We should focus on achieving a production-grade, end-to-end for a few metrics and alerts, rather than broad coverage. Additionally a key part of this epic is to ensure knowledge transfer across the RHOSE engineering team on how our data plane observability works.
Acceptance Criteria:
- Define a core set of monitoring for the RHOSE data plane for the 30 day release
- Define a core set of alerts for the RHOSE data plane for the 30 day release
- Install (or piggyback the RHOC installation) Observability operator into the data plane for RHOSE
- Provide configuration for the Observability operator to collect metrics for the core set of monitoring requirements
- Provide configuration for the Observability operator to emit the core set of alerts
- Write SOP for the team to debug and fix the alerts that we emit
- For our production environment, ensure that alerts are sent to PagerDuty and received by the team member on-call
- Present to the RHOSE engineering team how the data plane observability stack is configured and works
End to end definition
- Observability operator (OO) installed into DP
- RHOSE operator can configure the OO
- We have your first metric to support "ManagedBridges are working" being scraped
- We have an alert(s) that will fire when "ManagedBridges are not working"
- That alert is forwarded to PagerDuty and sent to the person on-call
- The Alert includes a link to an SOP that gives some hints on what to check to make Bridges work again
- is caused by
-
MGDOBR-1031 Investigate monitoring and alerting strategy for the Data Plane
- Closed