-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Proactive Architecture
-
False
-
-
False
-
-
100% To Do, 0% In Progress, 0% Done
-
4
-
0
Feature Overview (aka. Goal Summary)
In HCP environments, certain metrics such as csv_succeeded are produced by control plane pods. These metrics are scraped in the management cluster and sent to RHOBS but are not propagated to the hosted cluster's monitoring stack. As a result, users are unable to see these critical metrics directly in the hosted cluster.
The goal of this feature is to implement a solution for pushing these metrics to the data plane of hosted clusters. This may include pushing them to telemetry, ensuring that they are queryable through existing monitoring tools, and resolving the issue where certain clusters (e.g., with zero workers) may not propagate these metrics.
Goals (aka. expected user outcomes)
- Identify a solution to push csv_succeeded and similar control plane metrics to the hosted cluster's monitoring stack or telemetry.
- Explore how to ensure that these metrics can be queried reliably for ROSA/ARO clusters.
- Define the behavior of the monitoring system when clusters have zero workers.
- Provide recommendations on which control plane metrics should be propagated to the data plane.
Requirements (aka. Acceptance Criteria):
- Telemetry Integration: The system must push control plane-generated metrics, such as csv_succeeded, to telemetry for hcp backed clusters (e.g., ROSA). These metrics should be queryable using existing telemetry tools.
- Data Propagation: Ensure that the csv_succeeded metric and other relevant control plane metrics are propagated from the management cluster to the hosted cluster’s monitoring stack or available via telemetry, regardless of cluster size or configuration (including clusters with zero workers).{}
- Metric Selection: Define which control plane metrics need to be pushed to the data plane, ensuring that all critical metrics (e.g., operator health metrics) are available for dashboards and monitoring purposes.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Yes |
Classic (standalone cluster) | |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Use Cases (Optional):
- When monitoring the health of operators in HCP clusters, engineers need to ensure they have access to critical metrics like csv_succeeded, which are currently not available in the hosted cluster’s monitoring stack. They want a way to reliably retrieve these metrics through telemetry so that they can maintain full visibility into the operational health of the clusters.
- When creating dashboards and reports for ACM, the team needs accurate and complete data on cluster versions and operator health, which is currently missing for HCP clusters. They want the relevant metrics to be pushed to telemetry, allowing them to generate comprehensive reports across all clusters, regardless of the environment.
- When managing clusters with zero workers, platform administrators need to ensure that critical control plane metrics are still propagated, even when no workers are present. They want a solution that guarantees key metrics are pushed to telemetry so that they can maintain visibility into cluster health and functionality, regardless of worker availability.
- When querying for metrics in ROSA environments, customers need access to operator health metrics that are typically only available in the management cluster. They want a reliable way to query telemetry for these metrics, allowing them to audit, analyze, and report on the health of their clusters effectively.
Documentation Considerations
Document how to consume desired metrics .
- links to