-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.15.z
Description of problem:
According to documentation [1], when NodeHealthCheck detects that the cluster is being upgraded it pauses the remedations to avoid affecting the node updates. But in a configuration of a Hosted Control Plane (HCP) in which the NodeHealthCheck operator is running inside it and monitoring its workers, the upgrade process seems to pass unadvertised by the NHC and can affect the workers as there is no clusterVersionOperator running inside the HCP. [1] https://docs.redhat.com/en/documentation/workload_availability_for_red_hat_openshift/24.3/html-single/remediation_fencing_and_maintenance/index#about-node-health-check-operator_node-health-check-operator ~~~ During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting. ~~~
Version-Release number of selected component (if applicable):
Currently tested in 4.15.22 - 4.15.27 but probably will affect more versions
How reproducible:
Always
Steps to Reproduce:
1. Install RHACM in the Hosting cluster 2. Deploy a Hosted Cluster + two extra worker nodes 3. Inside the Hosted clsuter deploy all the Workload Availability suit 4. In order to properly test it, reduce the unHealthyConditions to "30s" as with the default 5min it can happen that the node reboots within this time period: ~~~ kind: NodeHealthCheck ... spec: ... unhealthyConditions: - duration: 30s status: "False" type: Ready - duration: 30s status: Unknown type: Ready ~~~ 5. Start the upgrade of the HCP
Actual results:
During upgrade the node is identified as inactive and remediation started: conditions: - lastTransitionTime: "2024-10-10T15:38:06Z" message: No issues found, NodeHealthCheck is enabled. reason: NodeHealthCheckEnabled status: "False" type: Disabled healthyNodes: 1 lastUpdateTime: "2024-10-11T08:12:20Z" observedNodes: 2 phase: Remediating reason: NHC is remediating 1 nodes unhealthyNodes: - name: hosted-worker-1 remediations: - resource: apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1 kind: MachineDeletionRemediation name: hosted-worker-1 namespace: openshift-workload-availability uid: eab0e40b-5c46-4da7-ac3c-d0bb41aee6e8 started: "2024-10-11T08:12:20Z" templateName: machinedeletionremediationtemplate-sample
Expected results:
No remediation started
Additional info: