Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-24341

[2161184] Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration

XMLWordPrintable

    • CNV Virtualization Sprint 239
    • Important
    • No

      Description of problem:

      If the virtual machine migration is canceled before the virt-launcher detects the qemu-kvm process pid, the target virt-launcher is not cleaned up immediately and waits for the qemu-timeout.

      It will wait in the refresh monitor here https://github.com/kubevirt/kubevirt/blob/f77d50591ddd0f74c0c876e38fdf14ca3fe54be8/pkg/virt-launcher/monitor.go#L126.

      Since the virt-launcher didn't find the pid yet, mon.pid will be always 0, and mon.isDone will be false.

      Migration was canceled here:

      ~~~

      {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup","name":"rhel7-quick-halibut","namespace":"default","pos":"server.go:151","timestamp":"2023-01-16T08:31:05.730454Z","uid":"64f6bc95-0b0d-4cb2-b954-69318cc409a3"} {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 76 with status 0","pos":"virt-launcher-monitor.go:125","timestamp":"2023-01-16T08:31:05.983324Z"} {"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:5894","subcomponent":"libvirt","thread":"26","timestamp":"2023-01-16T08:31:06.070000Z"}

      ~~~

      Then it waits for the qemu pid and finally timeout after qemu-timeout which here is 5m11s:

      ~~~

      {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:06.420909Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:07.421195Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:29.420918Z"}

      .....
      .....
      .....

      {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:36:16.421068Z"} {"component":"virt-launcher","level":"info","msg":"default_rhel7-quick-halibut not found after timeout","pos":"monitor.go:129","timestamp":"2023-01-16T08:36:16.421119Z"} {"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:270","timestamp":"2023-01-16T08:36:16.421153Z"} {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:501","timestamp":"2023-01-16T08:36:16.422034Z"}

      ~~~

      Although I can also see the message "Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup", it doesn't seem to have any effect here since it is setting receivedEarlyExitSignalEnvVar and is only queried in waitForDomainUUID which is before the refresh monitor.

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization 4.11.2

      How reproducible:

      100 %

      Steps to Reproduce:

      1. Start a virtual machine migration.
      2. Cancel the VM migration. We have to cancel before the virt-launcher detects qemu pid. I was able to reproduce this easily when I cancel the migration immediately after the target pod was scheduled.

      Actual results:

      Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration

      Expected results:

      Since the user is canceling the migration, it is expected to immediately terminate the resources created for the migration instead of waiting for a timeout to hit.

      Additional info:

              iholder@redhat.com Itamar Holder
              rhn-support-nashok Nijin Ashok
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: