Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: CNV v4.14.0
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+
- qe_test_coverage?

Activity Type:
Quality / Stability / Reliability
Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2161184
Bugzilla Bug:
RHBZ: 2161184
Intelligence Requested:
Market:

Sprint:
CNV Virtualization Sprint 239
Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

If the virtual machine migration is canceled before the virt-launcher detects the qemu-kvm process pid, the target virt-launcher is not cleaned up immediately and waits for the qemu-timeout.

It will wait in the refresh monitor here https://github.com/kubevirt/kubevirt/blob/f77d50591ddd0f74c0c876e38fdf14ca3fe54be8/pkg/virt-launcher/monitor.go#L126.

Since the virt-launcher didn't find the pid yet, mon.pid will be always 0, and mon.isDone will be false.

Migration was canceled here:

~~~

{"component":"virt-launcher","kind":"","level":"info","msg":"Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup","name":"rhel7-quick-halibut","namespace":"default","pos":"server.go:151","timestamp":"2023-01-16T08:31:05.730454Z","uid":"64f6bc95-0b0d-4cb2-b954-69318cc409a3"} {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 76 with status 0","pos":"virt-launcher-monitor.go:125","timestamp":"2023-01-16T08:31:05.983324Z"} {"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:5894","subcomponent":"libvirt","thread":"26","timestamp":"2023-01-16T08:31:06.070000Z"}

~~~

Then it waits for the qemu pid and finally timeout after qemu-timeout which here is 5m11s:

~~~

{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:06.420909Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:07.421195Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:29.420918Z"}

.....
.....
.....

{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:36:16.421068Z"} {"component":"virt-launcher","level":"info","msg":"default_rhel7-quick-halibut not found after timeout","pos":"monitor.go:129","timestamp":"2023-01-16T08:36:16.421119Z"} {"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:270","timestamp":"2023-01-16T08:36:16.421153Z"} {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:501","timestamp":"2023-01-16T08:36:16.422034Z"}

~~~

Although I can also see the message "Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup", it doesn't seem to have any effect here since it is setting receivedEarlyExitSignalEnvVar and is only queried in waitForDomainUUID which is before the refresh monitor.

Version-Release number of selected component (if applicable):

OpenShift Virtualization 4.11.2

How reproducible:

100 %

Steps to Reproduce:

1. Start a virtual machine migration.
2. Cancel the VM migration. We have to cancel before the virt-launcher detects qemu pid. I was able to reproduce this easily when I cancel the migration immediately after the target pod was scheduled.

Actual results:

Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration

Expected results:

Since the user is canceling the migration, it is expected to immediately terminate the resources created for the migration instead of waiting for a timeout to hit.

Additional info:

external trackers

CEE GitLab cpaas-midstream/openshift-virtualization/kubevirt/merge_requests/2191

Red Hat Customer Portal 03407272

Red Hat Errata Tool 113931

Red Hat Issue Tracker CNV-24341

Red Hat Product Errata RHSA-2023:6817

links to

PR

(1 links to)