Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1852

[Observability] Improve control plane metric reporting in hosted cluster monitoring stack

XMLWordPrintable

    • Product / Portfolio Work
    • OCPSTRAT-1853Enhanced Visibility into Control Plane and Data Plane Metrics
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • 9
    • None
    • None
    • None
    • None
    • None
    • None

      Description

      This initiative provides comprehensive visibility into control plane metrics for users of hosted clusters (like ROSA/ARO) by combining the goals of OCPSTRAT-1659 and OCPSTRAT-1852. Currently, administrators of hosted clusters have a limited view of the control plane, which hinders their ability to monitor both the health of cluster operators and the behavior of their own workloads.

      Problem Statement

      1. Missing Metrics: Critical operator health metrics (e.g., `csv_succeeded`) are generated on the control plane within the management cluster but are not propagated to the hosted cluster's monitoring stack. This leaves users blind to important signals about operator status.
      2. Incomplete Kubernetes Metrics: While some metrics from the `kube-apiserver` are available, they can be inconsistent. Furthermore, metrics from other essential components like the `kube-scheduler` and `kube-controller-manager` are not exposed at all, preventing users from understanding how their workloads impact the control plane.
      3. Edge Cases: Metric propagation needs to be reliable, even for clusters with no worker nodes.

      Goals

      • Expose and Propagate Critical Metrics: Implement a robust mechanism to push key metrics from the management cluster's control plane to the hosted cluster's data plane. This includes:
        • Operator health metrics (as defined in OCPSTRAT-1659).
        • Metrics from `kube-apiserver`, `kube-scheduler`, and `kube-controller-manager` to provide insights into workload behavior (as defined in OCPSTRAT-1852).
      • Ensure Data Reliability: Improve the scraping mechanism for metrics like those from `kube-apiserver` to ensure consistency and accuracy.
      • Universal Availability: Ensure that these metrics are available and queryable through the standard hosted cluster monitoring tools and telemetry, regardless of the cluster's worker node configuration.

      By combining these efforts, we will provide hosted cluster users with a more complete and reliable view of their cluster's control plane, enabling better operational monitoring, workload management, and dashboarding.

      Proposed Metrics for Enhanced Observability

      No. Use Case / Component Priority
      1 Ability to observe state of cluster monitoring operator to identify whether it's Available, Degraded, Progressing - Cluster Operators: Monitoring, Console P0
      2 Ability to observe readiness of individual controllers of prometheus operator: Prometheus, Alertmanager, Thanos P0
      3 Ability to observe api-server metrics to track usage, control/define flow controls etc among cluster users. broken down by HTTP verbs, client IPs, and API-resource P0
      4 Ability to observe pod scheduling metrics to plan worker node capacity and adjust labels/taints on nodes or priority classes on pods P0
      5 Ability to track etcd storage utilization and performance to plan for cluster capacity and etcd limit of 8 GB. P0
      6 Ability to observe IAM to track AuthN and AuthZ patterns across users. Ability to track number of authN/AuthZ requests/failures on OpenShift Console P0
      7 Ability to observe node & kubelet metrics to plan capacity of nodes and troubleshoot application storage, memory, and cpu issues using: Process id limit based on pods/containers on node(s), storage used by containers for logs in the file-system, network usage based on image pull latencies, linux stats, storage operation latencies for application storage volumes and secrets, container level usage of resources P0
      8 Ability to track storage utilization of the PVC (EBS) assigned to the CMO used for persisting the metrics [AWS EBS Metrics] Total IO (R&W) Ops, Total IO (R&W) bytes, Total IO (R&W) times, IO queue length, EBS vol IO Exceeded check, EBS vol Throughput Exceeded check P1
      9 Ability to track number of images and image streams in the in-cluster image registry. Ability to troubleshoot application start-up and deployment issues by observing the storage (S3) operation latencies P1
      10 Ability to manage life cycle of operators available through OperatorHub: Get notified when an operator is introduced or removed from the marketplace, Get the notification when the installation, upgrade or removal of the OLM-managed operator is unsuccessful P2
      11 Ability to observe ingress controllers, haproxy-based routers, K8s services, and Load Balancers so that additional ingress controllers can be created to shard routes, routers part of ingress controllers can be sized, scaled or scheduled on nodes based number of sessions, front-end/backend performance etc, AWS quotas for ELBv2 can be managed, apps/routes published using optional ingress controllers can be monitored using Golden Signals P2

              linnguye.openshift Linh Nguyen
              cewong@redhat.com Cesar Wong
              None
              None
              None
              None
              Matthew Werner Matthew Werner
              Senthamilarasu S Senthamilarasu S
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: