Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1611

Provide proactive metrics around connectivity between control plane and data plane in HCP clusters.

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 100% To Do, 0% In Progress, 0% Done
    • 0

      Feature Overview (aka. Goal Summary)  

      This feature aims to introduce metrics that will enable Cluster Service Providers to proactively monitor and detect any issues in the connectivity between the control plane and data plane within Hosted Clusters. By providing real-time insights into connectivity health, this feature enhances the reliability and stability of HCP deployments, preventing downtime and reducing troubleshooting time.

      Background

      This feature request arises from a need identified following issue https://issues.redhat.com/browse/OCPBUGS-37486, where a lack of proactive detection led to a significant outage during an upgrade. By implementing these metrics, similar incidents can be prevented in the future, improving the overall stability and reliability of Hosted Control Planes (HCP).

      Goals (aka. expected user outcomes)

      • Cluster Service Providers can monitor metrics that indicate the health and status of connections betweenthe control plane and the data plane.
      • Proactive detection of connectivity issues, allowing for quicker resolution.
      • Enhanced visibility into the operational state of both control plane and data plane components, reducing unexpected outages during upgrades or configuration changes.

      Requirements (aka. Acceptance Criteria)

      • Metrics Exposure: Metrics are exposed in Konnectivity pods that reflect the connection status from the control plane to the data plane.
      • Data Plane Metrics: Metrics are exposed in data plane components that indicate the status of connections back to the control plane.
      • Error Reporting: The system reports errors or anomalies in connectivity, distinguishing between transient and persistent issues.
      • Scalability: The solution should scale with the number of nodes and clusters, without significant performance degradation.
      • Security: Ensure metrics are exposed securely, with proper authentication and authorization in place.
      • Reliability: The metrics system should be reliable, providing accurate and timely data even under load.

      Deployment considerations

      • Self-managed, managed, or both: Both
      • Classic (standalone cluster): N/A
      • Hosted control planes: Applicable
      • Multi node, Compact (three node), or Single node (SNO), or all: N/A
      • Connected / Restricted Network: Applicable, ensure metrics are accessible in both scenarios.
      • Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x): Applicable to all supported architectures.
      • Operator compatibility: Ensure compatibility with relevant operators managing HCP clusters.
      • Backport needed (list applicable versions): Assess need for backporting to relevant versions.
      • UI need (e.g. OpenShift Console, dynamic plugin, OCM): Maybe for ACM related to https://issues.redhat.com/browse/OCPSTRAT-1088 and https://issues.redhat.com/browse/OBSDA-451 
      • Other (please specify): N/A

      Use Cases (Optional)

      1. Proactive Monitoring: Cluster Service Providers monitor metrics during an HCP upgrade to ensure that connectivity between control plane and data plane remains intact.
      2. Troubleshooting: In the event of a reported issue, Cluster Service Providers can quickly identify whether a connectivity problem between control plane and data plane is the root cause, using the exposed metrics.
      3. Capacity Planning: Understanding connectivity trends over time can help in capacity planning and scaling decisions.

      Out of Scope

      • Any enhancements to the control plane or data plane components themselves beyond metrics exposure.
      • UI/UX enhancements outside of displaying the new metrics in existing dashboards.

      Customer Considerations

      • Ensure that metrics are tailored to be useful for customer-specific environments, including both managed and self-managed HCP deployments.
      • Provide guidance on how to integrate these metrics with existing monitoring and alerting systems.

      Documentation Considerations

      • Update Konnectivity documentation to include details on the new metrics.
      • Provide examples of how to interpret these metrics in the context of HCP clusters.
      • Include instructions for configuring alerts based on these metrics.

      Interoperability Considerations

      • Ensure compatibility with self-managed, ROSA, and ARO clusters where Hosted Control Planes are deployed.
      • Consider potential impacts on existing monitoring tools and integrate accordingly.

       

            azaalouk Adel Zaalouk
            azaalouk Adel Zaalouk
            Matthew Werner Matthew Werner
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: