[OCPBUGS-43096] NodeHealthCheck don't pause during upgrades of hosted cluster - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.15.z
Component/s: HyperShift
Labels:

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.17.z
Target Backport Versions:

4.15

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

According to documentation [1], when NodeHealthCheck detects that the cluster is being upgraded it pauses the remedations to avoid affecting the node updates.
But in a configuration of a Hosted Control Plane (HCP) in which the NodeHealthCheck operator is running inside it and monitoring its workers, the upgrade process seems to pass unadvertised by the NHC and can affect the workers as there is no clusterVersionOperator running inside the HCP.


[1] https://docs.redhat.com/en/documentation/workload_availability_for_red_hat_openshift/24.3/html-single/remediation_fencing_and_maintenance/index#about-node-health-check-operator_node-health-check-operator
~~~
During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting. 
~~~

Version-Release number of selected component (if applicable):

Currently tested in 4.15.22 - 4.15.27 but probably will affect more versions

How reproducible:

Always

Steps to Reproduce:

    1. Install RHACM in the Hosting cluster
    2. Deploy a Hosted Cluster + two extra worker nodes
    3. Inside the Hosted clsuter deploy all the Workload Availability suit
    4. In order to properly test it, reduce the unHealthyConditions to "30s" as with the default 5min it can happen that the node reboots within this time period:

~~~
kind: NodeHealthCheck
...
spec:
...
  unhealthyConditions:
  - duration: 30s
    status: "False"
    type: Ready
  - duration: 30s
    status: Unknown
    type: Ready
~~~
    5. Start the upgrade of the HCP

Actual results:

During upgrade the node is identified as inactive and remediation started:

conditions:
  - lastTransitionTime: "2024-10-10T15:38:06Z"
    message: No issues found, NodeHealthCheck is enabled.
    reason: NodeHealthCheckEnabled
    status: "False"
    type: Disabled
  healthyNodes: 1
  lastUpdateTime: "2024-10-11T08:12:20Z"
  observedNodes: 2
  phase: Remediating
  reason: NHC is remediating 1 nodes
  unhealthyNodes:
  - name: hosted-worker-1
    remediations:
    - resource:
        apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
        kind: MachineDeletionRemediation
        name: hosted-worker-1
        namespace: openshift-workload-availability
        uid: eab0e40b-5c46-4da7-ac3c-d0bb41aee6e8
      started: "2024-10-11T08:12:20Z"
      templateName: machinedeletionremediationtemplate-sample

Expected results:

No remediation started

Additional info:

duplicates

RHWA-11 NHC: Support hcp upgrade

Review

is triggered by

OCPSTRAT-1828 Enhance NodeHealthCheck (NHC) Functionality in Hosted Control Planes to Integrate with Upgrade Signals

links to

RHBA-2025:4012 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Alberto Garcia Lamela

Reporter:: Mario Abajo Duran

QA Contact:: He Liu

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2024/10/11 1:55 PM

Updated:: 2025/04/23 12:41 PM

Resolved:: 2025/04/21 1:28 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates