Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1828

Enhance NodeHealthCheck (NHC) Functionality in Hosted Control Planes to Integrate with Upgrade Signals

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 100% To Do, 0% In Progress, 0% Done
    • 8
    • 0

      Feature Overview (Goal Summary)

      This feature aims to address the current limitations of NodeHealthCheck (NHC) in Hosted Control Planes (HCP). While NHC is operational in HCP, it lacks the ability to block or pause upgrades when nodes are unhealthy. The goal is to explore solutions that enable NHC to make informed remediation decisions during upgrades by integrating or interacting with ClusterVersionOperator (CVO) and ClusterVersionStatus (CVS) signals.

      Background

      NodeHealthCheck (NHC) runs effectively in HCP environments but currently operates without visibility into ClusterVersionOperator (CVO) or ClusterVersionStatus (CVS) signals. This gap creates a risk during cluster upgrades, as NHC may remediate nodes that are temporarily unhealthy due to upgrade processes, potentially disrupting or failing the upgrade.

      This Jira explores multiple approaches to improve NHC functionality in HCP by providing the necessary visibility or coordination with CVO/CVS, enabling it to assess upgrade readiness and take appropriate actions.

      Goals (Expected User Outcomes)

      • Provide NHC in HCP with the ability to:
        • Access or replicate CVO/CVS upgrade signals.
        • Pause or adapt remediation actions during upgrades to prevent disruption.
        • Resume normal operations post-upgrade.
      • Ensure cluster stability during upgrades by addressing gaps in NHC’s behavior in HCP.
      • Maintain or improve usability of NHC for customers actively relying on its functionality.

      Use Cases

      1. Upgrade Safety:
        • Prevent upgrade failures caused by overlapping NHC remediation actions.
        • Provide visibility into node health and upgrade readiness.
      2. Feature Consistency: Align NHC behavior across Hosted Control Planes and standalone OpenShift environments.
      3. Customer Usage Support: Ensure NHC continues to meet growing customer needs in HCP environments.

      Requirements (Acceptance Criteria)

      1. Enable NHC to factor CVO/CVS signals into its remediation logic during upgrades.
      2. Explore and document the best approach for achieving this integration, considering options such as:
        • Replicating CVO/CVS data into tenant clusters.
        • Enhancing NHC functionality directly in HCP.
        • Exploring alternatives like fencing or eliminating NHC if integration is not feasible.

      Out of Scope

      1. NHC implementation changes in standalone OpenShift environments.
      2. Changes to the underlying CVO or its upgrade signaling mechanisms.

      Deployment Considerations

      Category Applicability
      Self-managed, managed, or both Both
      Classic (standalone cluster) N/A
      Hosted control planes Applicable
      Multi-node, Compact, or SNO N/A
      Connected/Restricted Network Both
      Architectures x86_64, ARM, IBM Power, IBM Z
      Operator compatibility Must integrate with ClusterVersionOperator
      Backport needed  
      UI need  
      Other  

       

              gausingh@redhat.com Gaurav Singh
              azaalouk Adel Zaalouk
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: