Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77171

HyperShift NodePool scale-down does not prioritize NotReady nodes for deletion

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.19
    • HyperShift
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          When scaling down a HyperShift NodePool (KubeVirt provider) by reducing the replica count, the controller does not consider Node health status (e.g. NotReady) when selecting which Machine to delete. A healthy (Ready) node is removed instead of the unhealthy (NotReady) node, leaving the degraded node running in the cluster.
      
      
      I guess the upstream CAPI MachineSet controllers implement a deletion priority that favors removing unhealthy nodes before healthy ones. HyperShift's NodePool controller does not appear to implement equivalent health-aware deletion logic.
      

       

      Version-Release number of selected component (if applicable):

         100%

      How reproducible:

          OCP 4.19.21
          MCE 2.10
          HCP 4.19.20

       

      Steps to Reproduce:

      1. Create a HyperShift hosted cluster using KubeVirt provider with a NodePool of 4 replicas:
      
      oc get nodes --kubeconfig /tmp/kubeconfig
      NAME                        STATUS   ROLES    AGE     VERSION
      kubevirt-test-n8nwn-9s74h   Ready    worker   20m     v1.32.9
      kubevirt-test-n8nwn-njmk8   Ready    worker   4m55s   v1.32.9
      kubevirt-test-n8nwn-nvpm9   Ready    worker   6d22h   v1.32.9
      kubevirt-test-n8nwn-r2gnm   Ready    worker   26d     v1.32.9
      
      2. Stop kubelet on one of the nodes to simulate a NotReady condition:
      
      oc --kubeconfig /tmp/kubeconfig debug node/kubevirt-test-n8nwn-nvpm9
      chroot /host
      systemctl stop kubelet
      
      3. Verify the node transitions to NotReady:
      
      oc get nodes --kubeconfig /tmp/kubeconfig
      NAME                        STATUS     ROLES    AGE     VERSION
      kubevirt-test-n8nwn-9s74h   Ready      worker   20m     v1.32.9
      kubevirt-test-n8nwn-njmk8   Ready      worker   5m6s    v1.32.9
      kubevirt-test-n8nwn-nvpm9   NotReady   worker   6d22h   v1.32.9
      kubevirt-test-n8nwn-r2gnm   Ready      worker   26d     v1.32.9
      
      4. Scale down the NodePool from 4 to 3 replicas:
      oc scale --replicas 3 np/kubevirt-test -n clusters
      
      
      5. Observe which node/machine is selected for deletion.
      
      

      Actual results:

          The NodePool controller selected a *healthy Ready node* (kubevirt-test-n8nwn-9s74h, age 21m) for deletion instead of the NotReady node (kubevirt-test-n8nwn-nvpm9):
      
      
      kubevirt-test-n8nwn-9s74h   kubevirt-test   kubevirt-test-n8nwn-9s74h   kubevirt://kubevirt-test-n8nwn-9s74h   Deleting   28m     4.19.20
      
      Note: All Machines/VMIs were in Running state. The NotReady condition was at the Node level (kubelet stopped inside the guest VM), not at the Machine/VMI level.
      

      Expected results:

          The NodePool controller should prioritize deleting Machines whose corresponding Nodes are in NotReady state before deleting Machines with healthy Ready nodes. 
      
      This is consistent with upstream CAPI MachineSet deletion priority which orders unhealthy nodes ahead of healthy ones.
      

      Additional info:

          I did not test the behavior with HCP agent, but I guess if the CAPI logic is same it will have the same behavior as I saw in HCP kubevirt

              Unassigned Unassigned
              rhn-support-dpateriy Divyam Pateriya
              None
              None
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: