Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-12063

Multi-signal Observability Storage, Collection and Query Support in ACM

XMLWordPrintable

    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • 57% To Do, 7% In Progress, 36% Done

      Feature Overview

      This initiative, internally referred to as Project Nexus, aims to create a single product for multi-cluster observability needs. It focuses on simplifying and unifying the installation of observability data collection across an RHACM-managed fleet of OpenShift Container Platform (OCP) clusters. The MCOA will manage, reconcile, and enrich Custom Resource Definitions (CRDs) and allow users to configure established in-cluster collectors (like OpenTelemetry Collector, ClusterLogForwarder, and Network Observability Operator) alongside a scalable multi-cluster storage stack (Thanos for Metrics and Netflows, Loki for Logs and Netflows, and Tempo for Traces).

      Goals

      The primary goal is to simplify and unify the installation of observability data collection on an RHACM-managed fleet of OCP clusters. This involves extending Multi-Cluster Observability (MCO) to become a comprehensive collection and storage add-on for ACM.

      Expected user outcomes include:

      • For Fleet Administrators:
        • Ability to set up an RH-managed collection of platform observability data to store all logs, metrics, and network flows on an external managed storage and visualization provider.
        • Ability to establish an RH-managed end-to-end observability platform to reliably collect, store, and visualize all platform logs, metrics, and network flows within RHACM.
        • Ability to configure ingestion and storage of observability data from edge-deployed components.
      • For Platform SREs:
        • Ability to configure end-to-end collection of platform observability data for troubleshooting platform issues.
        • Ability to configure storage of platform observability data to optimize for cost or performance.
        • Ability to assess the health of storage or collection platform observability components via service level indicators to intervene for scaling.
        • Ability to create logging and metrics-based alerts for the platform to better handle incidents.
      • For Application SREs/Developers:
        • Ability to configure end-to-end collection of user workload observability data for debugging, optimization, and management.
        • Ability to configure storage of user workload observability data to optimize for cost.
        • Ability to configure auto-instrumentation of workloads.
        • Ability to view observability data for their applications in accessible namespaces, including for edge-deployed components.

      Requirements

      Requirement Notes isMvp?
      CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
      Release Technical Enablement Provide necessary release enablement details and documents. YES
      Extend MCO CRD The MultiClusterObservability CRD will be extended with a new optional field capabilities to define desired observability data collection and storage methods for platform and user workloads. YES
      Support for Platform Observability Encompass all observability data from components required to operate each managed cluster and RHACM agents, with end-to-end support from collection to storage and OCP/RHACM-based visualization. YES
      Support for User Workload Observability Include all observability data produced by workloads maintained by Platform SRE, Application SRE, and Application Developers, with end-to-end support from collection to storage. YES
      MCOA as a pluggable addon for OCM Automate the collection and forwarding of Observability Signals using component CRDs for each Observability Collector/Storage medium. YES
      Secure By Design Implement comprehensive authentication and authorization (authN/authZ) mechanisms for secure, multi-tenant visibility at the per-signal level. YES
      Blueprint-based Design Clearly defined APIs for each signal (metrics, logs, etc.) will serve as customizable blueprints, enabling consistent configuration across the entire fleet. YES
      Streamlined Provisioning of existing observability solutions Prioritize simple tools to provision existing observability solutions fleetwide, ensuring consistency and speed of deployment. YES

      (Optional) Use Cases

      Main success scenarios - high-level user stories:

      • As a Fleet Administrator, I want to set up an RH-managed collection of platform observability data so that I can store all logs, metrics, and network flows on an external managed storage and visualization provider.
      • As a Fleet Administrator, I want to set up an RH-managed end-to-end observability platform so that I can reliably collect, store, and visualize all platform logs, metrics, and network flows within RHACM.
      • As a Platform SRE, I want to configure end-to-end collection of platform observability data so that I can troubleshoot platform issues.
      • As an Application SRE, I want to configure end-to-end collection of user workload Observability Data so that I can debug, optimize, and manage applications.
      • As an Application Developer, I want to view observability data for applications that I build in namespaces that I have access to.

      Alternate flow/scenarios - high-level user stories:

      • As a Fleet Administrator for Edge devices, I want to be able to configure ingestion and storage of observability data from edge-deployed components.
      • As a Platform SRE, I want to configure storage of platform observability data so that I can optimize for cost or performance.
      • As an Application Developer on the edge, I want to view observability data for my edge-deployed components.

      Questions to answer

      • How will Operator version subscriptions and OCP compatibility be managed at a fleetwide scale?
      • How will migration from the existing stack to the MCOA managed stack be managed? (A migration DDR will address this in the future).

      Out of Scope

      • The documents do not explicitly list items as "Out of Scope" for the entire feature but mention that some items were deprioritized for specific releases due to resource constraints (e.g., MCOA Hub Trace Storage for MCO 2.13 - ACM-12479).
      • Initially, the focus is on collection to storage for user workloads, with visualization potentially hosted with platform data but not necessarily a direct part of the initial user workload support.

      Background, and strategic fit

      This feature aims to address the challenges users face with observability in increasingly complex Kubernetes environments, including hybrid cloud footprints, microservice architectures, and multi-cluster tenancy. Traditional solutions like Observability-as-a-Service can be inflexible and costly, while manual tooling leads to high operational overhead. Project Nexus intends to provide a unified, easy-to-manage observability solution tailored for multi-cluster environments, leveraging the foundations within ACM MCO (Add-on Framework) and existing OpenShift Observability APIs. This allows for strategic platform extension without starting from scratch.
      The strategic fit includes:

      • Simplified Multi-Cluster Management: Leveraging ACM's existing cluster control plane simplifies deployment and setup compared to standalone tools.
      • Value Added to Existing Platform: Enhances the value of ACM for OpenShift Platform Plus (OPP) customers.
      • Focus and Consistency: Provides a cohesive way to achieve fleet-wide observability within the Red Hat portfolio.
      • Extensible Shared Platform: MCOA’s design allows other teams to integrate additional signals (e.g., network flows, power metrics).
      • Addressing Market Needs: There's a large existing customer base using ACM, and users want to observe the clusters they are managing, find problems quickly, and have a cohesive end-to-end experience.

      Assumptions:

      • Users are leveraging ACM for managing their OCP clusters.
      • There is a need for a centralized and simplified way to manage observability across multiple clusters.
      • Open source projects like Thanos, Loki, Tempo, and OpenTelemetry are suitable and battle-tested for building this observability platform.

      Customer Considerations

      • Customers operate across a wide spectrum of cloud maturity and manage complex Kubernetes environments.
      • They often have hybrid cloud footprints and microservice architectures.
      • Multi-cluster tenancy requires fine-grained access controls.
      • Concerns exist regarding the inflexibility, cost, and data sovereignty of Observability-as-a-Service solutions.
      • The feature aims to provide signal flexibility, allowing users to use sensible defaults for day-0 operations but also customize for day-1+ needs, including storing signals locally, sending to a central hub, or integrating with third-party systems.
      • Cost efficiency is a consideration, leveraging existing single-cluster solutions where possible.

      Documentation Considerations

      • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
        • Documentation will be needed for Fleet Administrators, Platform SREs, and Application SREs/Developers. This includes setup guides, configuration details for various collectors and storage options, troubleshooting guides, and information on how to use the unified visualizations.
        • Security officers might need information on the authN/authZ mechanisms and data handling.
      • Does this feature have a doc impact?
        • Yes, there will be a doc impact. The "ACM-DDR-025: Multi-cluster Observability Addon" document states "None until tech preview, however, if we are taking doc from other teams, we need a conversation with Josh and doc manager about how that will look.". Given the scope, new content and updates to existing content will be required.
      • New Content, Updates to existing content, Release Note, or No Doc Impact
        • New Content: For the MCOA, new capabilities, CRD configurations, user workflows, and UI features.
        • Updates to existing content: MCO documentation will need to be updated to reflect these new capabilities and the integration of MCOA.
        • Release Notes: Will be required for each release detailing new features, changes, and bug fixes.
      • What concepts do customers need to understand to be successful in [action]?
        • Customers need to understand core observability concepts (metrics, logs, traces, network flows).
        • Familiarity with Kubernetes CRDs and OpenShift Operators.
        • Understanding of RHACM and its role in multi-cluster management.
        • Concepts related to the specific observability tools being integrated (Thanos, Loki, Tempo, OpenTelemetry).
      • How do we expect customers will use the feature? For what purpose(s)?
        • Customers are expected to use the feature to gain comprehensive observability across their entire fleet of OCP clusters managed by RHACM.
        • Purposes include platform monitoring, troubleshooting, incident response, application performance monitoring, cost optimization for observability data storage, and ensuring the health and performance of their applications and infrastructure.
      • What reference material might a customer want/need to complete [action]?
      • ** Included below

      Core Design Documents & Explainers:

      Roadmaps & Repositories:

      Other Relevant Design Documents Referenced:

              rhn-support-cstark Christian Stark
              rhn-support-cstark Christian Stark
              Xiang Yin Xiang Yin
              Christian Stark Christian Stark
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: