Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-387

NodeHealthCheck status.observedNodes does not update dynamically when new nodes match its selector due to a label change

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate

      The NodeHealthCheck (NHC) controller correctly identifies the initial set of nodes matching its spec.selector when the NHC object is first created.

      However, if a node that did not initially match the selector is later updated (i.e., a new label is added) to match the selector, the NHC controller fails to detect this change. The node is not added to the .status.observedNodes list and is consequently not monitored for health.

      The only way to force the controller to recognize the newly labeled node is to either delete/recreate the NHC object or restart the NHC controller manager pod.

      Steps to Reproduce

      1. Prerequisites: A cluster with the NodeHealthCheck operator running and at least two worker nodes (e.g., node-1 and node-2).

      Initial Node Labeling:

       Label node-1 with both required labels:

      oc label node node-1 hypershift.openshift.io/nodePool=test
      oc label node node-1 fencing=true

      Label node-2 with only one of the required labels:

      oc label node node-2 hypershift.openshift.io/nodePool=test

      2. Create NodeHealthCheck Object:

      Apply the following NodeHealthCheck CR, which selects on both labels:

        selector:
          matchLabels:
            hypershift.openshift.io/nodePool: test
            fencing: "true" 

      3. Observe Initial State:

      Check the status of the newly created NHC object:

      oc get nhc <name> -n openshift-workload-availability -o yaml | grep -i observedNodes
        observedNodes: 1

      4. Trigger the Bug (Update Node 2):

      Add the missing fencing: "true" label to node-2, making it a valid target for the NHC selector:

      oc label node node-2 fencing=true 

      node-2 now fully matches spec.selector.matchLabels.

      5. Observe Final State:
      Wait several minutes for the controller to reconcile and check the status again:

      oc get nhc <name> -n openshift-workload-availability -o yaml | grep -i observedNodes

      Actual Results

      The status.observedNodes field remains at 1. The controller never detects that node-2 now matches the selector, and the node is not monitored.

      Expected Results

      After node-2 is labeled, the NHC controller's reconciliation loop should detect the change, re-evaluate its selector, and add node-2 to its list of monitored nodes.

      The status.observedNodes field should update to 2.

      Workaround

      Manually forcing the controller to re-initialize its list of nodes "fixes" the issue:

      Workaround 1: Delete and recreate the NHC object.
      Workaround 2: Restart the NHC controller manager pod.

      oc delete pods -l app.kubernetes.io/component=controller-manager -n openshift-workload-availability

       
       

              Unassigned Unassigned
              rhn-support-vdurgam Vedant Durgam
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: