Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Planes
Labels:

Work Type:
Proactive Architecture
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OCPSTRAT-1853Enhanced Visibility into Control Plane and Data Plane Metrics
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

Business Value:
4
Risk Score:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

In HCP environments, certain metrics such as csv_succeeded are produced by control plane pods. These metrics are scraped in the management cluster and sent to RHOBS but are not propagated to the hosted cluster's monitoring stack. As a result, users are unable to see these critical metrics directly in the hosted cluster.

The goal of this feature is to implement a solution for pushing these metrics to the data plane of hosted clusters. This may include pushing them to telemetry, ensuring that they are queryable through existing monitoring tools, and resolving the issue where certain clusters (e.g., with zero workers) may not propagate these metrics.

Goals (aka. expected user outcomes)

Identify a solution to push csv_succeeded and similar control plane metrics to the hosted cluster's monitoring stack or telemetry.
Explore how to ensure that these metrics can be queried reliably for ROSA/ARO clusters.
Define the behavior of the monitoring system when clusters have zero workers.
Provide recommendations on which control plane metrics should be propagated to the data plane.

Requirements (aka. Acceptance Criteria):

Telemetry Integration: The system must push control plane-generated metrics, such as csv_succeeded, to telemetry for hcp backed clusters (e.g., ROSA). These metrics should be queryable using existing telemetry tools.
Data Propagation: Ensure that the csv_succeeded metric and other relevant control plane metrics are propagated from the management cluster to the hosted cluster’s monitoring stack or available via telemetry, regardless of cluster size or configuration (including clusters with zero workers).{}

Metric Selection: Define which control plane metrics need to be pushed to the data plane, ensuring that all critical metrics (e.g., operator health metrics) are available for dashboards and monitoring purposes.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Yes
Classic (standalone cluster)
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

When monitoring the health of operators in HCP clusters, engineers need to ensure they have access to critical metrics like csv_succeeded, which are currently not available in the hosted cluster’s monitoring stack. They want a way to reliably retrieve these metrics through telemetry so that they can maintain full visibility into the operational health of the clusters.

When creating dashboards and reports for ACM, the team needs accurate and complete data on cluster versions and operator health, which is currently missing for HCP clusters. They want the relevant metrics to be pushed to telemetry, allowing them to generate comprehensive reports across all clusters, regardless of the environment.

When managing clusters with zero workers, platform administrators need to ensure that critical control plane metrics are still propagated, even when no workers are present. They want a solution that guarantees key metrics are pushed to telemetry so that they can maintain visibility into cluster health and functionality, regardless of worker availability.

When querying for metrics in ROSA environments, customers need access to operator health metrics that are typically only available in the management cluster. They want a reliable way to query telemetry for these metrics, allowing them to audit, analyze, and report on the health of their clusters effectively.

Documentation Considerations

Document how to consume desired metrics .

links to

Slack discussion

Assignee:: Unassigned

Reporter:: Adel Zaalouk

Doc Contact:: Matthew Werner

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/09/17 9:23 AM

Updated:: 2025/02/27 4:49 PM

Details

Description

Feature Overview (aka. Goal Summary)

Use Cases (Optional):

Documentation Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates