Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-451

Incorporate a clear monitoring story with self-managed Hosted Control Planes

XMLWordPrintable

    • False
    • None
    • False
    • Not Selected
    • 0% To Do, 0% In Progress, 100% Done

      Feature Overview (aka. Goal Summary)

      This feature aims to enhance observability and user experience for customers of self-managed Hosted Control Planes (HCP) using ACM/MCE by leveraging the existing observability feature stack (e.g., the pluggable dashboard console feature in the OCP console as the MVP in case ACM is not in use). This approach ensures improved monitoring capabilities and aligns with the tenancy model of User Workload Monitoring (UWM), also strongly encourages an upsell from MCE to ACM to access those features and provide a best/practice and validated pattern for customers willing to build it on their own (with a lot of effort vs. ACM).

      Goals (aka. expected user outcomes)

      Users, particularly SRE teams (the cluster service provider persona), will gain enhanced visibility into the health and performance of their HCPs through a customizable monitoring dashboard. This dashboard will provide critical metrics and alerts, aiding in proactive management and troubleshooting. Existing observability features in ACM will be expanded to include these capabilities.

      Requirements (aka. Acceptance Criteria)

      • Introduction of custom dashboards via the OCP console dashboard plugin feature.
      • Monitoring and tracking for agreed-upon SLIs/SLOs.
      • Dashboard configuration per HCP, aligning with the UWM tenancy model.
      • Alerts are exposed to highlight symptoms, potentially following predefined runbooks.
      • Enhanced visibility into HCP health and performance (API server, control plane).
      • Unified observability dashboard within ACM for centralized monitoring.
      • Clear reporting of key signals for SRE teams.
      • Actionable alerts based on monitored signals.

      Key Considerations

      • Dashboard creation is to be initiated when the customer opts in for all metrics (not just telemetry). By default, not all metrics are exported to avoid overloading the monitoring stack. 
      • The dashboard will track key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) like API availability, API server error rates, usage for the rest of the control plane and in the future latency between the control plane and workers. We will start with the top three easiest metrics to implement.
      • Additonally, Alerts should be exposed to highlight symptoms. 
      • We aim to provide a pragmatic, if not aesthetically perfect, user experience from a monitoring standpoint without muddling our ACM messaging. The Northstar here is the ACM observability stack as a sustainable comprehensive monitoring solution.
      • Dashboard configuration is per HCP, with each HCP living in its own OpenShift project (namespace). This is compatible with the tenancy model of User Workload Monitoring (UWM).

      Deployment Considerations

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both Self-managed (but reusable in managed with xCM)
      Classic (standalone cluster) N/A
      Hosted control planes Applicable
      Multi node, Compact (three node), or Single node (SNO), or all N/A
      Connected / Restricted Network Applicable
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Applicable
      Operator compatibility Observability Operator (ObO)
      Backport needed (list applicable versions) N/A
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) OpenShift Console, dynamic plugin
      Other (please specify) N/A

      Use Cases 

      • Monitoring API availability and error rates in a self-managed HCP.
      • Alerting SRE teams (cluster service providers) about critical performance issues in real-time.
      • Unified monitoring across multiple clusters via ACM including feature parity for HCP.

      Open Discussion / Long-term Concerns 

      The usage of UWM for HCP metrics on the management cluster has a few drawbacks:

      • Configuration via ConfigMap being more error-prone and less GitOps friendly
      • Fewer configuration knobs than with Out of the Box with the Observability Operator (ObO), and the slower delivery model bound to the OCP release cadence. 

      These issues would be resolved with using ObO, which is currently being productized.

       

      Other questions to answer:

      • How will the dashboard handle large volumes of metrics without overloading the monitoring stack?
      • What specific runbooks will be referenced for alerting?
      • How will the configuration be managed to ensure GitOps compatibility?

      Background

      This feature should leverage existing functionality when possible to align with other OCP observability efforts (e.g., pluggable dashboard console feature in the OCP console) to provide enhanced observability for HCP users. It should align with the existing UWM tenancy model and address immediate monitoring needs while considering future improvements via the Observability Operator.

      Customer Considerations

      Customers opting for full metrics export must be aware of the potential impact on the monitoring stack. Clear documentation and guidelines will be provided to manage configuration and alerts effectively.

      Documentation Considerations

      Documentation will include setup guides, configuration examples, and troubleshooting tips. It will also link to existing ACM observability documentation for comprehensive coverage.

              rh-ee-rfloren Roger Florén
              azaalouk Adel Zaalouk
              Cesar Wong, Daniel Mohr, Derek Carr, Eric Paris, Jan Fajerski, Roger Florén
              Laura Hinson Laura Hinson
              Cesar Wong Cesar Wong
              Adel Zaalouk Adel Zaalouk
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: