-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
CNV v4.20.0
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
Important
-
None
Description of problem:
virt-handler is hot-looping detecting the completion of a migration. See: {"component":"virt-handler","kind":"","level":"info","msg":"The target node detected that the migration has completed","name":"rhel9-2423","namespace":"vm-ns-1","pos":"migration-target.go:213","timestamp":"2025-10-14T15:27:07.546287Z","uid":"19ec4892-c15b-4f90-a331-4f5bdb811259"} {"component":"virt-handler","kind":"","level":"info","msg":"The target node detected that the migration has completed","name":"rhel9-2423","namespace":"vm-ns-1","pos":"migration-target.go:213","timestamp":"2025-10-14T15:27:07.562381Z","uid":"19ec4892-c15b-4f90-a331-4f5bdb811259"} {"component":"virt-handler","kind":"","level":"info","msg":"The target node detected that the migration has completed","name":"rhel9-2423","namespace":"vm-ns-1","pos":"migration-target.go:213","timestamp":"2025-10-14T15:27:07.619072Z","uid":"19ec4892-c15b-4f90-a331-4f5bdb811259"} {"component":"virt-handler","kind":"","level":"info","msg":"The target node detected that the migration has completed","name":"rhel9-2423","namespace":"vm-ns-1","pos":"migration-target.go:213","timestamp":"2025-10-14T15:27:07.634057Z","uid":"19ec4892-c15b-4f90-a331-4f5bdb811259"} [root@e44-h32-000-r650 ~]# oc logs -n openshift-cnv virt-handler-fz7cl -c virt-handler| grep "The target node detected that the migration has completed" | grep "rhel9-2423" | wc -l 1811 ... [root@e44-h32-000-r650 ~]# oc logs -n openshift-cnv virt-handler-fz7cl -c virt-handler| grep "The target node detected that the migration has completed" | grep "rhel9-2423" | wc -l 2177 ... [root@e44-h32-000-r650 ~]# oc logs -n openshift-cnv virt-handler-fz7cl -c virt-handler| grep "The target node detected that the migration has completed" | grep "rhel9-2423" | wc -l 3606 node-drain-controller is trying to evict virt-launcher pod on that node as a result of a node drain. The admission webhook is correctly setting status.evacuationNodeName on that VMI object, but the same code block in virt-handler that logs "The target node detected that the migration has completed" is also immediately removing Status.EvacuationNodeName (see: https://github.com/kubevirt/kubevirt/blob/0c17b3287760e90ef012d2536818ea3fbf2c5069/pkg/virt-handler/migration-target.go#L200-L214 ) and so a new migration as a result of the eviction request is never executed and the node drain is stuck.
Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.4.20.0-220
How reproducible:
??? encountered only once for a large scale upgrade test with 10k VMIs
Steps to Reproduce:
tbd. Still unclear
Actual results:
An old migration to the node actually succeeded (the VM is really there, the old source pod is gone, the target one is correctly there). Something is not correctly aligned (where ?). virt-handler is hot-looping detecting the end of the migration and removing status.evacuationNodeName and this block future eviction. As a result the drain of the node is stuck.
Expected results:
virt-handler correctly detects that the old migrations successfully completed. A new one could be triggered and the node can be successfully drained.
Additional info:
https://github.com/kubevirt/kubevirt/pull/15721 is a related change around this area. tbd