Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z, 4.20.z
Component/s: Cloud Compute / KubeVirt Provider
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None
Architecture:

x86_64

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CNV I/U Operators Sprint 281
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Create a HCP Kubevirt cluster with a few VMs that cannot live migrate (i.e. using local storage of GPU) and with EvictionStrategy set to External. Then evict a VM that is a node of the Guest cluster and note it will not drain the guest node.

Step by step:

1. Stop and set the eviction strategy of the guest cluster VMs to External, which is what OCPBUGS-58397: feat(KubeVirt): configure External evictionStrategy on VMs will do.

kind: VirtualMachine
metadata:
spec:
   template:
    spec:
      evictionStrategy: External

2. Bring those VMs up again, ensure its all fine.

[HOST] # oc get vmi
NAME                            AGE   PHASE     IP             NODENAME               READY
hostedcluster-420-9cqbx-jzq4g   28m   Running   10.129.4.100   cyan.shift.home.arpa   True
hostedcluster-420-9cqbx-lb2k4   28m   Running   10.129.4.101   cyan.shift.home.arpa   True
hostedcluster-420-9cqbx-qm6z7   23m   Running   10.129.4.103   cyan.shift.home.arpa   True

[GUEST] # oc get nodes 
NAME                            STATUS   ROLES    AGE   VERSION
hostedcluster-420-9cqbx-jzq4g   Ready    worker   72m   v1.33.5
hostedcluster-420-9cqbx-lb2k4   Ready    worker   65m   v1.33.5
hostedcluster-420-9cqbx-qm6z7   Ready    worker   72m   v1.33.5

3. Next, let's evict one of the VMs, I'll pick the last one hostedcluster-420-9cqbx-qm6z7:

[HOST] # oc adm drain cyan.shift.home.arpa --pod-selector='kubevirt.io/vm=hostedcluster-420-9cqbx-qm6z7' --delete-emptydir-data

4\. CAPK puts the annotation in and the eviction gets blocked for a while

I1205 02:40:56.487949       1 machine.go:510] "msg"="setting the capk.cluster.x-k8s.io/vmi-deletion-grace-time annotation" "KubevirtMachine"={"name":"hostedcluster-420-9cqbx-qm6z7","namespace":"clusters-hostedcluster-420"} "controller"="kubevirtmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="KubevirtMachine" "logger"="clusters-hostedcluster-420.hostedcluster-420-9cqbx-qm6z7" "name"="hostedcluster-420-9cqbx-qm6z7" "namespace"="clusters-hostedcluster-420" "reconcileID"="3e3d400d-d535-447d-ad37-e6446af549f1"

...

error when evicting pods/"virt-launcher-hostedcluster-420-9cqbx-qm6z7-tdfb9" -n "clusters-hostedcluster-420" (will retry after 5s): admission webhook "virt-launcher-eviction-interceptor.kubevirt.io" denied the request: Eviction triggered evacuation of VMI "clusters-hostedcluster-420/hostedcluster-420-9cqbx-qm6z7"
...

5. But then CAPK fails to find the Guest node to drain it

E1205 02:40:56.496122       1 machine.go:539] "msg"="Could not find node from noderef, it may have already been deleted" "error"="nodes \"cyan.shift.home.arpa\" not found" "KubevirtMachine"={"name":"hostedcluster-420-9cqbx-qm6z7","namespace":"clusters-hostedcluster-420"} "controller"="kubevirtmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="KubevirtMachine" "logger"="clusters-hostedcluster-420.hostedcluster-420-9cqbx-qm6z7" "name"="hostedcluster-420-9cqbx-qm6z7" "namespace"="clusters-hostedcluster-420" "reconcileID"="3e3d400d-d535-447d-ad37-e6446af549f1"

6. And gives up. The VM is now on its way to shutdown without drain, as soon as the timeout above expires.

I1205 02:40:56.532994       1 machine.go:470] "msg"="DrainNode: the virtualMachineInstance is already in deletion process. Nothing to do here" "KubevirtMachine"={"name":"hostedcluster-420-9cqbx-qm6z7","namespace":"clusters-hostedcluster-420"} "controller"="kubevirtmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="KubevirtMachine" "logger"="clusters-hostedcluster-420.hostedcluster-420-9cqbx-qm6z7" "name"="hostedcluster-420-9cqbx-qm6z7" "namespace"="clusters-hostedcluster-420" "reconcileID"="ee985ad7-1a96-438a-864b-1835b848bb29"

The error in step 5 is in this part of the code:

func (m *Machine) drainNode(wrkldClstr workloadcluster.WorkloadCluster) (time.Duration, error) {
        
        .....

	nodeName := m.vmiInstance.Status.EvacuationNodeName   <---- why? this is the host node, not guest node we want to drain?
	node, err := kubeClient.CoreV1().Nodes().Get(m.machineContext, nodeName, metav1.GetOptions{})
	if err != nil {
		if apierrors.IsNotFound(err) {
			// If an admin deletes the node directly, we'll end up here.
			m.machineContext.Logger.Error(err, "Could not find node from noderef, it may have already been deleted")   <------ we get here
			return 0, nil
		}
		return 0, fmt.Errorf("unable to get node %q: %w", nodeName, err)
	}

Link: https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/blob/b1ad7eddf047dcdde80f46d9cdaece523a15c6a2/pkg/kubevirt/machine.go#L535

It seems to be looking for:

nodeName := m.vmiInstance.Status.EvacuationNodeName

But this is the host node name, not the guest node name (which it wants to drain). Then it doesn't find the host node in the guest cluster.

Look at this, its the host node:

[HOST] # oc adm drain cyan.shift.home.arpa --pod-selector='kubevirt.io/vm=hostedcluster-420-9cqbx-qm6z7' --delete-emptydir-data
[HOST] # oc get vmi hostedcluster-420-9cqbx-qm6z7 -o yaml | yq '.status.evacuationNodeName'
cyan.shift.home.arpa

Shouldn't it be looking for the guest node name there to trigger the drain?

Version-Release number of selected component (if applicable):

OCP 4.20.4 (both clusters)
CNV 4.20.1

How reproducible:

Always

Steps to Reproduce:

As above

Actual results:

Guest node is not drained

Expected results:

Guest node drains

relates to

OCPBUGS-58397 Unable to configure External evictionStrategy on HyperShift NodePool VMs

Closed

links to

[KCS] HCP Cluster node is not drained when underlying OpenShift Virtualization provider Virtual Machine is evicted

Assignee:: Nahshon Unna Tsameret

Reporter:: Germano Veit Michel

Need Info From:: None

Contributors:: None

QA Contact:: Ying Zhou

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/12/05 3:20 AM

Updated:: 2026/02/19 10:04 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates