Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19
Component/s: HyperShift
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    When scaling down a HyperShift NodePool (KubeVirt provider) by reducing the replica count, the controller does not consider Node health status (e.g. NotReady) when selecting which Machine to delete. A healthy (Ready) node is removed instead of the unhealthy (NotReady) node, leaving the degraded node running in the cluster.


I guess the upstream CAPI MachineSet controllers implement a deletion priority that favors removing unhealthy nodes before healthy ones. HyperShift's NodePool controller does not appear to implement equivalent health-aware deletion logic.

Version-Release number of selected component (if applicable):

   100%

How reproducible:

    OCP 4.19.21
    MCE 2.10
    HCP 4.19.20

Steps to Reproduce:

1. Create a HyperShift hosted cluster using KubeVirt provider with a NodePool of 4 replicas:

oc get nodes --kubeconfig /tmp/kubeconfig
NAME                        STATUS   ROLES    AGE     VERSION
kubevirt-test-n8nwn-9s74h   Ready    worker   20m     v1.32.9
kubevirt-test-n8nwn-njmk8   Ready    worker   4m55s   v1.32.9
kubevirt-test-n8nwn-nvpm9   Ready    worker   6d22h   v1.32.9
kubevirt-test-n8nwn-r2gnm   Ready    worker   26d     v1.32.9

2. Stop kubelet on one of the nodes to simulate a NotReady condition:

oc --kubeconfig /tmp/kubeconfig debug node/kubevirt-test-n8nwn-nvpm9
chroot /host
systemctl stop kubelet

3. Verify the node transitions to NotReady:

oc get nodes --kubeconfig /tmp/kubeconfig
NAME                        STATUS     ROLES    AGE     VERSION
kubevirt-test-n8nwn-9s74h   Ready      worker   20m     v1.32.9
kubevirt-test-n8nwn-njmk8   Ready      worker   5m6s    v1.32.9
kubevirt-test-n8nwn-nvpm9   NotReady   worker   6d22h   v1.32.9
kubevirt-test-n8nwn-r2gnm   Ready      worker   26d     v1.32.9

4. Scale down the NodePool from 4 to 3 replicas:
oc scale --replicas 3 np/kubevirt-test -n clusters


5. Observe which node/machine is selected for deletion.

Actual results:

    The NodePool controller selected a *healthy Ready node* (kubevirt-test-n8nwn-9s74h, age 21m) for deletion instead of the NotReady node (kubevirt-test-n8nwn-nvpm9):


kubevirt-test-n8nwn-9s74h   kubevirt-test   kubevirt-test-n8nwn-9s74h   kubevirt://kubevirt-test-n8nwn-9s74h   Deleting   28m     4.19.20

Note: All Machines/VMIs were in Running state. The NotReady condition was at the Node level (kubelet stopped inside the guest VM), not at the Machine/VMI level.

Expected results:

    The NodePool controller should prioritize deleting Machines whose corresponding Nodes are in NotReady state before deleting Machines with healthy Ready nodes. 

This is consistent with upstream CAPI MachineSet deletion priority which orders unhealthy nodes ahead of healthy ones.

Additional info:

    I did not test the behavior with HCP agent, but I guess if the CAPI logic is same it will have the same behavior as I saw in HCP kubevirt

Assignee:: Unassigned

Reporter:: Divyam Pateriya

Need Info From:: None

Contributors:: None

QA Contact:: Yu Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2026/02/23 3:35 PM

Updated:: 2026/02/23 4:52 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates