-
Feature Request
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
This is a followup RFE of OCPBUGS-59414.
1. Proposed title of this feature request
Observability for control plane health
2. What is the nature and description of the request?
ClusterOperators in OCP classic provide health metrics and statuses for aggregated components. This is not the case in HCP, where some control plane components are deployed directly by the control plane operator. For some components, existing ClusterOperators and CVO metrics are currently a red herring as they are mocked by the hosted cluster config operator. They do not properly reflect the health of control plane components. Other components don't have a cluster operator, but there is also no metrics (and status) presenting the health of specific cluster operations.
I found the following list of components/cluster operations previously existing as cluster operators for which the health is now not easy to track, as it would require re-creating custom observability logic:
- openshift-apiserver (cluster operator present, but health mocked)
- openshift-controller-manager (cluster operator present, but health mocked)
- kube-apiserver (cluster operator present, but health mocked)
- kube-controller-manager (cluster operator present, but health mocked)
- kube-scheduler (cluster operator present, but health mocked)
- Operator-lifecycle-manager-packageserver (cluster operator present, but health mocked)
- authentication
- cloud-controller-manager
- cloud-credential
- cluster-autoscaler
- etcd
- machine-approver
- marketplace
Additionally, we would like to easily monitor the health of new components added with HCP via metrics and statuses, e.g. but not only:
- CAPI (previously present as machine-api cluster operator for classic) / ignition / etc.
- hosted-cluster-config-operator
- control-plane-operator
This RFE requests a replacement for ClusterOperators for HCP, which should provide the health and - if existing - degradation cause of aggregated components to service providers running the control plane, as well as a degradation cause if applicable.
For components that are running on both the control and data plane (e.g. cloud-controler-manager), the service provider should be able to distinguish between the state of the control and data plane. Ideally, it should be clear whether the degradation is caused by a misconfiguration / modification on the HostedCluster user's data plane or cloud environment or on the service provider's side.
---
Example use case for authentication: currently SRE observes the health of pods running in the control plane, and we consider oauth pods running as the source of truth for our alerting on cluster authentication working.
This has clear gaps:
- What We See Now: We only monitor if the Oauth pods are running. That's our current "source of truth" for cluster authentication working.
- The Flaw: Just because the Oauth pods are up doesn't mean authentication actually works. It only proves the component is available, not that the functionality is healthy.
- The Missing Link: The crucial setup logic - taking the IDP configuration, performing necessary DNS checks, and generating the configmap used by the oauth pods is handled by the CPO outside of the Oauth pods.
- The Result: If the CPO fails during the configuration pipeline (e.g., a DNS lookup error), authentication will be broken for users, but our "Oauth Pods Running" observability will remain green.
- What We Need: A single, aggregated status (like the classic OCP authentication Cluster Operator) that checks the entire flow / aggregates components: Is the CPO configuring the IDP correctly, AND are the Oauth pods running correctly? We need a health metric that confirms the function is working as expected.
3. Why does the customer need this? (List the business requirements here)
General observability of HCP.
4. List any affected packages or components.
- causes
-
OCPSTRAT-2563 Feature Parity with ROSA Classic for ROSA-HCP
-
- In Progress
-
- is caused by
-
OCPSTRAT-1611 [Observability] Provide proactive metrics around connectivity between control plane and data plane in HCP clusters.
-
- In Progress
-
- relates to
-
OCPBUGS-59414 No CVO metrics for etcd
-
- Closed
-
-
OCPSTRAT-1853 Enhanced Visibility into Control Plane and Data Plane Metrics
-
- In Progress
-