-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Product / Portfolio Work
-
-
False
-
-
False
-
None
-
9
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Description
This initiative provides comprehensive visibility into control plane metrics for users of hosted clusters (like ROSA/ARO) by combining the goals of OCPSTRAT-1659 and OCPSTRAT-1852. Currently, administrators of hosted clusters have a limited view of the control plane, which hinders their ability to monitor both the health of cluster operators and the behavior of their own workloads.
Problem Statement
- Missing Metrics: Critical operator health metrics (e.g., `csv_succeeded`) are generated on the control plane within the management cluster but are not propagated to the hosted cluster's monitoring stack. This leaves users blind to important signals about operator status.
- Incomplete Kubernetes Metrics: While some metrics from the `kube-apiserver` are available, they can be inconsistent. Furthermore, metrics from other essential components like the `kube-scheduler` and `kube-controller-manager` are not exposed at all, preventing users from understanding how their workloads impact the control plane.
- Edge Cases: Metric propagation needs to be reliable, even for clusters with no worker nodes.
Goals
- Expose and Propagate Critical Metrics: Implement a robust mechanism to push key metrics from the management cluster's control plane to the hosted cluster's data plane. This includes:
- Operator health metrics (as defined in
OCPSTRAT-1659). - Metrics from `kube-apiserver`, `kube-scheduler`, and `kube-controller-manager` to provide insights into workload behavior (as defined in OCPSTRAT-1852).
- Operator health metrics (as defined in
- Ensure Data Reliability: Improve the scraping mechanism for metrics like those from `kube-apiserver` to ensure consistency and accuracy.
- Universal Availability: Ensure that these metrics are available and queryable through the standard hosted cluster monitoring tools and telemetry, regardless of the cluster's worker node configuration.
By combining these efforts, we will provide hosted cluster users with a more complete and reliable view of their cluster's control plane, enabling better operational monitoring, workload management, and dashboarding.
Proposed Metrics for Enhanced Observability
No. | Use Case / Component | Priority |
---|---|---|
1 | Ability to observe state of cluster monitoring operator to identify whether it's Available, Degraded, Progressing - Cluster Operators: Monitoring, Console | P0 |
2 | Ability to observe readiness of individual controllers of prometheus operator: Prometheus, Alertmanager, Thanos | P0 |
3 | Ability to observe api-server metrics to track usage, control/define flow controls etc among cluster users. broken down by HTTP verbs, client IPs, and API-resource | P0 |
4 | Ability to observe pod scheduling metrics to plan worker node capacity and adjust labels/taints on nodes or priority classes on pods | P0 |
5 | Ability to track etcd storage utilization and performance to plan for cluster capacity and etcd limit of 8 GB. | P0 |
6 | Ability to observe IAM to track AuthN and AuthZ patterns across users. Ability to track number of authN/AuthZ requests/failures on OpenShift Console | P0 |
7 | Ability to observe node & kubelet metrics to plan capacity of nodes and troubleshoot application storage, memory, and cpu issues using: Process id limit based on pods/containers on node(s), storage used by containers for logs in the file-system, network usage based on image pull latencies, linux stats, storage operation latencies for application storage volumes and secrets, container level usage of resources | P0 |
8 | Ability to track storage utilization of the PVC (EBS) assigned to the CMO used for persisting the metrics [AWS EBS Metrics] Total IO (R&W) Ops, Total IO (R&W) bytes, Total IO (R&W) times, IO queue length, EBS vol IO Exceeded check, EBS vol Throughput Exceeded check | P1 |
9 | Ability to track number of images and image streams in the in-cluster image registry. Ability to troubleshoot application start-up and deployment issues by observing the storage (S3) operation latencies | P1 |
10 | Ability to manage life cycle of operators available through OperatorHub: Get notified when an operator is introduced or removed from the marketplace, Get the notification when the installation, upgrade or removal of the OLM-managed operator is unsuccessful | P2 |
11 | Ability to observe ingress controllers, haproxy-based routers, K8s services, and Load Balancers so that additional ingress controllers can be created to shard routes, routers part of ingress controllers can be sized, scaled or scheduled on nodes based number of sessions, front-end/backend performance etc, AWS quotas for ELBv2 can be managed, apps/routes published using optional ingress controllers can be monitored using Golden Signals | P2 |
- relates to
-
RFE-7673 Enable Hosted Cluster users to monitor CMO stack
-
- Approved
-