-
Feature
-
Resolution: Duplicate
-
Major
-
None
-
None
-
Future Sustainability
-
-
False
-
-
False
-
None
-
9
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Title: Enhanced Control Plane Metric Observability for Hosted Clusters
Description
This initiative provides comprehensive visibility into control plane metrics for users of hosted clusters (like ROSA/ARO) by combining the goals of OCPSTRAT-1659 and OCPSTRAT-1852. Currently, administrators of hosted clusters have a limited view of the control plane, which hinders their ability to monitor both the health of cluster operators and the behavior of their own workloads.
Problem Statement
- Missing Metrics: Critical operator health metrics (e.g., `csv_succeeded`) are generated on the control plane within the management cluster but are not propagated to the hosted cluster's monitoring stack. This leaves users blind to important signals about operator status.
- Incomplete Kubernetes Metrics: While some metrics from the `kube-apiserver` are available, they can be inconsistent. Furthermore, metrics from other essential components like the `kube-scheduler` and `kube-controller-manager` are not exposed at all, preventing users from understanding how their workloads impact the control plane.
- Edge Cases: Metric propagation needs to be reliable, even for clusters with no worker nodes.
Goals
- Expose and Propagate Critical Metrics: Implement a robust mechanism to push key metrics from the management cluster's control plane to the hosted cluster's data plane. This includes:
- Operator health metrics (as defined in
OCPSTRAT-1659). - Metrics from `kube-apiserver`, `kube-scheduler`, and `kube-controller-manager` to provide insights into workload behavior (as defined in OCPSTRAT-1852).
- Operator health metrics (as defined in
- Ensure Data Reliability: Improve the scraping mechanism for metrics like those from `kube-apiserver` to ensure consistency and accuracy.
- Universal Availability: Ensure that these metrics are available and queryable through the standard hosted cluster monitoring tools and telemetry, regardless of the cluster's worker node configuration.
By combining these efforts, we will provide hosted cluster users with a more complete and reliable view of their cluster's control plane, enabling better operational monitoring, workload management, and dashboarding.
- links to