Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-451

Incorporate a clear monitoring story with self-managed Hosted Control Planes

XMLWordPrintable

    • False
    • None
    • False
    • Not Selected
    • 100
    • 100% 100%
    • 0

      Overview

      This feature aims to provide a monitoring story for customers of a self-managed Hosted Control Plane (ACM/MCE with HCP) by reusing the pluggable dashboard console feature in the OCP console as the MVP in case ACM is not in use. This feature will allow for enhanced observability and improved user experience. An example of how such a dashboard can be configured is below:

      kind: ConfigMap
      metadata: 
        labels: 
          console.openshift.io/dashboard: "true"
        name: basic-hcp-dashboard
        namespace: hypershift
      data: ...
      
      

       Key Considerations

      • Dashboard creation is to be initiated when the customer opts in for all metrics (not just telemetry). By default, not all metrics are exported to avoid overloading the monitoring stack. 
      • The dashboard will track key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) like API availability, API server error rates, usage for the rest of the control plane and in the future latency between the control plane and workers. We will start with the top three easiest metrics to implement.
      • Additonally, Alerts should be exposed to highlight symptoms. 
      • We aim to provide a pragmatic, if not aesthetically perfect, user experience from a monitoring standpoint without muddling our ACM messaging. The Northstar here is the ACM observability stack as a sustainable comprehensive monitoring solution.
      • Dashboard configuration is per HCP, with each HCP living in its own OpenShift project (namespace). This is compatible with the tenancy model of User Workload Monitoring (UWM).

      Open Discussion / Long-term Concerns 

      The usage of UWM for HCP metrics on the management cluster has a few drawbacks:

      • Configuration via ConfigMap being more error-prone and less GitOps friendly
      • Fewer configuration knobs than with Out of the Box with the Observability Operator (ObO), and the slower delivery model bound to the OCP release cadence. 

      These issues would be resolved with using ObO, which is currently being productized.

      Acceptance Criteria

      1. Introduction of custom dashboards via the OCP console dashboard plugin feature. 
      2. The dashboard provides monitoring and tracking for the agreed-upon SLIs/SLOs.
      3. The dashboard configuration is per HCP, aligning with the tenancy model of UWM.
      4. Alerts are exposed highlighting symptoms, potentially following the runbooks: https://github.com/openshift/runbooks/tree/master/alerts
      5. Successful communication and cooperation with the rest of the team to ensure no details are missed, and the right story is communicated to the customer in our documentation

            rh-ee-rfloren Roger Florén
            azaalouk Adel Zaalouk
            Cesar Wong, Daniel Mohr, Derek Carr, Eric Paris, Jan Fajerski, Roger Florén
            Laura Hinson Laura Hinson
            Cesar Wong Cesar Wong
            Adel Zaalouk Adel Zaalouk
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: