-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
100% To Do, 0% In Progress, 0% Done
-
8
-
0
Feature Overview (Goal Summary)
This feature aims to address the current limitations of NodeHealthCheck (NHC) in Hosted Control Planes (HCP). While NHC is operational in HCP, it lacks the ability to block or pause upgrades when nodes are unhealthy. The goal is to explore solutions that enable NHC to make informed remediation decisions during upgrades by integrating or interacting with ClusterVersionOperator (CVO) and ClusterVersionStatus (CVS) signals.
Background
NodeHealthCheck (NHC) runs effectively in HCP environments but currently operates without visibility into ClusterVersionOperator (CVO) or ClusterVersionStatus (CVS) signals. This gap creates a risk during cluster upgrades, as NHC may remediate nodes that are temporarily unhealthy due to upgrade processes, potentially disrupting or failing the upgrade.
This Jira explores multiple approaches to improve NHC functionality in HCP by providing the necessary visibility or coordination with CVO/CVS, enabling it to assess upgrade readiness and take appropriate actions.
Goals (Expected User Outcomes)
- Provide NHC in HCP with the ability to:
- Access or replicate CVO/CVS upgrade signals.
- Pause or adapt remediation actions during upgrades to prevent disruption.
- Resume normal operations post-upgrade.
- Ensure cluster stability during upgrades by addressing gaps in NHC’s behavior in HCP.
- Maintain or improve usability of NHC for customers actively relying on its functionality.
Use Cases
- Upgrade Safety:
- Prevent upgrade failures caused by overlapping NHC remediation actions.
- Provide visibility into node health and upgrade readiness.
- Feature Consistency: Align NHC behavior across Hosted Control Planes and standalone OpenShift environments.
- Customer Usage Support: Ensure NHC continues to meet growing customer needs in HCP environments.
Requirements (Acceptance Criteria)
- Enable NHC to factor CVO/CVS signals into its remediation logic during upgrades.
- Explore and document the best approach for achieving this integration, considering options such as:
- Replicating CVO/CVS data into tenant clusters.
- Enhancing NHC functionality directly in HCP.
- Exploring alternatives like fencing or eliminating NHC if integration is not feasible.
Out of Scope
- NHC implementation changes in standalone OpenShift environments.
- Changes to the underlying CVO or its upgrade signaling mechanisms.
Deployment Considerations
Category | Applicability |
---|---|
Self-managed, managed, or both | Both |
Classic (standalone cluster) | N/A |
Hosted control planes | Applicable |
Multi-node, Compact, or SNO | N/A |
Connected/Restricted Network | Both |
Architectures | x86_64, ARM, IBM Power, IBM Z |
Operator compatibility | Must integrate with ClusterVersionOperator |
Backport needed | |
UI need | |
Other |
- is triggering
-
OCPBUGS-43096 NodeHealthCheck don't pause during upgrades of hosted cluster
- ASSIGNED
- relates to
-
OCPSTRAT-1615 Enhanced Debuggability for HyperShift Cluster NodePool Failures
- New