-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
Quality / Stability / Reliability
-
5
-
False
-
-
False
-
CLOSED
-
-
-
CNV Virtualization Sprint 239
-
Important
-
No
Description of problem:
If the virtual machine migration is canceled before the virt-launcher detects the qemu-kvm process pid, the target virt-launcher is not cleaned up immediately and waits for the qemu-timeout.
It will wait in the refresh monitor here https://github.com/kubevirt/kubevirt/blob/f77d50591ddd0f74c0c876e38fdf14ca3fe54be8/pkg/virt-launcher/monitor.go#L126.
Since the virt-launcher didn't find the pid yet, mon.pid will be always 0, and mon.isDone will be false.
Migration was canceled here:
~~~
{"component":"virt-launcher","kind":"","level":"info","msg":"Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup","name":"rhel7-quick-halibut","namespace":"default","pos":"server.go:151","timestamp":"2023-01-16T08:31:05.730454Z","uid":"64f6bc95-0b0d-4cb2-b954-69318cc409a3"} {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 76 with status 0","pos":"virt-launcher-monitor.go:125","timestamp":"2023-01-16T08:31:05.983324Z"} {"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:5894","subcomponent":"libvirt","thread":"26","timestamp":"2023-01-16T08:31:06.070000Z"}~~~
Then it waits for the qemu pid and finally timeout after qemu-timeout which here is 5m11s:
~~~
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:06.420909Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:07.421195Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:29.420918Z"}.....
.....
.....
~~~
Although I can also see the message "Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup", it doesn't seem to have any effect here since it is setting receivedEarlyExitSignalEnvVar and is only queried in waitForDomainUUID which is before the refresh monitor.
Version-Release number of selected component (if applicable):
OpenShift Virtualization 4.11.2
How reproducible:
100 %
Steps to Reproduce:
1. Start a virtual machine migration.
2. Cancel the VM migration. We have to cancel before the virt-launcher detects qemu pid. I was able to reproduce this easily when I cancel the migration immediately after the target pod was scheduled.
Actual results:
Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration
Expected results:
Since the user is canceling the migration, it is expected to immediately terminate the resources created for the migration instead of waiting for a timeout to hit.
Additional info: