Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43096

NodeHealthCheck don't pause during upgrades of hosted cluster

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      According to documentation [1], when NodeHealthCheck detects that the cluster is being upgraded it pauses the remedations to avoid affecting the node updates.
      But in a configuration of a Hosted Control Plane (HCP) in which the NodeHealthCheck operator is running inside it and monitoring its workers, the upgrade process seems to pass unadvertised by the NHC and can affect the workers as there is no clusterVersionOperator running inside the HCP.
      
      
      [1] https://docs.redhat.com/en/documentation/workload_availability_for_red_hat_openshift/24.3/html-single/remediation_fencing_and_maintenance/index#about-node-health-check-operator_node-health-check-operator
      ~~~
      During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting. 
      ~~~

      Version-Release number of selected component (if applicable):

      Currently tested in 4.15.22 - 4.15.27 but probably will affect more versions

      How reproducible:

      Always

      Steps to Reproduce:

          1. Install RHACM in the Hosting cluster
          2. Deploy a Hosted Cluster + two extra worker nodes
          3. Inside the Hosted clsuter deploy all the Workload Availability suit
          4. In order to properly test it, reduce the unHealthyConditions to "30s" as with the default 5min it can happen that the node reboots within this time period:
      
      ~~~
      kind: NodeHealthCheck
      ...
      spec:
      ...
        unhealthyConditions:
        - duration: 30s
          status: "False"
          type: Ready
        - duration: 30s
          status: Unknown
          type: Ready
      ~~~
          5. Start the upgrade of the HCP

      Actual results:

      During upgrade the node is identified as inactive and remediation started:
      
      conditions:
        - lastTransitionTime: "2024-10-10T15:38:06Z"
          message: No issues found, NodeHealthCheck is enabled.
          reason: NodeHealthCheckEnabled
          status: "False"
          type: Disabled
        healthyNodes: 1
        lastUpdateTime: "2024-10-11T08:12:20Z"
        observedNodes: 2
        phase: Remediating
        reason: NHC is remediating 1 nodes
        unhealthyNodes:
        - name: hosted-worker-1
          remediations:
          - resource:
              apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
              kind: MachineDeletionRemediation
              name: hosted-worker-1
              namespace: openshift-workload-availability
              uid: eab0e40b-5c46-4da7-ac3c-d0bb41aee6e8
            started: "2024-10-11T08:12:20Z"
            templateName: machinedeletionremediationtemplate-sample

      Expected results:

      No remediation started    

      Additional info:

          

       

              agarcial@redhat.com Alberto Garcia Lamela
              rhn-support-mabajodu Mario Abajo Duran
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: