Uploaded image for project: 'Red Hat OpenBridge'
  1. Red Hat OpenBridge
  2. MGDOBR-1098

Data Plane Observability


    • Data Plane Observability
    • False
    • None
    • False
    • To Do
    • 0% To Do, 0% In Progress, 100% Done
    • MGDOBR - Sprint 226

      Issue Description:

      Control Plane monitoring/alerting for RHOSE is provided through our deployment on app-interface. For the data plane, we are currently running with no monitoring/alerting.

      This epic exists to introduce the framework for monitoring and alerting in the RHOSE data plane. The focus of this epic is a "steel-thread" implementation of our data plane monitoring. We should focus on achieving a production-grade, end-to-end for a few metrics and alerts, rather than broad coverage. Additionally a key part of this epic is to ensure knowledge transfer across the RHOSE engineering team on how our data plane observability works.

      Acceptance Criteria:

      • Define a core set of monitoring for the RHOSE data plane for the 30 day release
      • Define a core set of alerts for the RHOSE data plane for the 30 day release
      • Install (or piggyback the RHOC installation) Observability operator into the data plane for RHOSE
      • Provide configuration for the Observability operator to collect metrics for the core set of monitoring requirements
      • Provide configuration for the Observability operator to emit the core set of alerts
      • Write SOP for the team to debug and fix the alerts that we emit
      • For our production environment, ensure that alerts are sent to PagerDuty and received by the team member on-call
      • Present to the RHOSE engineering team how the data plane observability stack is configured and works

      End to end definition

      1. Observability operator (OO) installed into DP
      2. RHOSE operator can configure the OO
      3. We have your first metric to support "ManagedBridges are working" being scraped
      4. We have an alert(s) that will fire when "ManagedBridges are not working"
      5. That alert is forwarded to PagerDuty and sent to the person on-call
      6. The Alert includes a link to an SOP that gives some hints on what to check to make Bridges work again


            manstis@redhat.com Michael Anstis
            manstis@redhat.com Michael Anstis
            0 Vote for this issue
            1 Start watching this issue
