Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Planes
Labels:
- ServiceDeliveryBlocker

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

This is a followup RFE of ~~OCPBUGS-59414~~.

1. Proposed title of this feature request

Observability for control plane health

2. What is the nature and description of the request?

ClusterOperators in OCP classic provide health metrics and statuses for aggregated components. This is not the case in HCP, where some control plane components are deployed directly by the control plane operator. For some components, existing ClusterOperators and CVO metrics are currently a red herring as they are mocked by the hosted cluster config operator. They do not properly reflect the health of control plane components. Other components don't have a cluster operator, but there is also no metrics (and status) presenting the health of specific cluster operations.

I found the following list of components/cluster operations previously existing as cluster operators for which the health is now not easy to track, as it would require re-creating custom observability logic:

openshift-apiserver (cluster operator present, but health mocked)
openshift-controller-manager (cluster operator present, but health mocked)
kube-apiserver (cluster operator present, but health mocked)
kube-controller-manager (cluster operator present, but health mocked)
kube-scheduler (cluster operator present, but health mocked)
Operator-lifecycle-manager-packageserver (cluster operator present, but health mocked)
authentication
cloud-controller-manager
cloud-credential
cluster-autoscaler
etcd
machine-approver
marketplace

Additionally, we would like to easily monitor the health of new components added with HCP via metrics and statuses, e.g. but not only:

CAPI (previously present as machine-api cluster operator for classic) / ignition / etc.
hosted-cluster-config-operator
control-plane-operator

This RFE requests a replacement for ClusterOperators for HCP, which should provide the health and - if existing - degradation cause of aggregated components to service providers running the control plane, as well as a degradation cause if applicable.

For components that are running on both the control and data plane (e.g. cloud-controler-manager), the service provider should be able to distinguish between the state of the control and data plane. Ideally, it should be clear whether the degradation is caused by a misconfiguration / modification on the HostedCluster user's data plane or cloud environment or on the service provider's side.

---

Example use case for authentication: currently SRE observes the health of pods running in the control plane, and we consider oauth pods running as the source of truth for our alerting on cluster authentication working.

This has clear gaps:

What We See Now: We only monitor if the Oauth pods are running. That's our current "source of truth" for cluster authentication working.
The Flaw: Just because the Oauth pods are up doesn't mean authentication actually works. It only proves the component is available, not that the functionality is healthy.
The Missing Link: The crucial setup logic - taking the IDP configuration, performing necessary DNS checks, and generating the configmap used by the oauth pods is handled by the CPO outside of the Oauth pods.
The Result: If the CPO fails during the configuration pipeline (e.g., a DNS lookup error), authentication will be broken for users, but our "Oauth Pods Running" observability will remain green.
What We Need: A single, aggregated status (like the classic OCP authentication Cluster Operator) that checks the entire flow / aggregates components: Is the CPO configuring the IDP correctly, AND are the Oauth pods running correctly? We need a health metric that confirms the function is working as expected.

3. Why does the customer need this? (List the business requirements here)

General observability of HCP.

4. List any affected packages or components.

causes

OCPSTRAT-2563 Feature Parity with ROSA Classic for ROSA-HCP

In Progress

is caused by

OCPSTRAT-1611 [Observability] Provide proactive metrics around connectivity between control plane and data plane in HCP clusters.

In Progress

relates to

OCPBUGS-59414 No CVO metrics for etcd

Closed

OCPSTRAT-1853 Enhanced Visibility into Control Plane and Data Plane Metrics

In Progress

Assignee:: Ramon Acedo

Reporter:: Claudio Busse

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2025/08/28 3:25 PM

Updated:: 2025/11/20 11:11 AM

Target start:: None

Target end:: None

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates