Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-63721

Lost track of migration when cancelled during pre-copy "switchover" phase

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • CNV v4.21.0
    • CNV v4.18.3
    • CNV Virt-Node
    • None
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Moderate
    • None

      Description of problem:

      
      Customer is testing migrations by creating and deleting them in a loop. It appears that cancelling a migration at a very specific point in time - the switchover phase at the end of precopy, where the VM is briefly suspended to copy the remaining pages - the system will lose track of the migration: pretending to cancel it (the VMIM is deleted and the logs indicate it cancelled), but the libvirt job is left "running" indefinitely, blocking further migrations.
      
      

      Version-Release number of selected component (if applicable):

      4.18.3
      

      How reproducible:

      Takes a while but reproduces, may need to tune the sleep to allow for destination pod preparation
      
      $ cat migrate.sh 
      #!/bin/bash
      
      while true; do
      	virtctl migrate rhel-94
      	sleep 7
      	virtctl migrate-cancel rhel-94
      done
      
      

      Steps to Reproduce:

      1. Create and delete migrations in a loop
      

      Actual results:

      * Migration is left running on libvirt side
      * Cannot migrate again
      

      Expected results:

      * Clean up and also cancel the libvirt job
      

      Additional info:

      1. Migration initiated
      {"component":"virt-launcher","kind":"","level":"info","msg":"Initiating live migration.","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:1003","timestamp":"2025-06-09T19:51:37.945455Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      
      {"component":"virt-launcher","kind":"","level":"info","msg":"generated migration parameters: [....]
      
      2. Immediatelly cancel:
      {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration as been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"server.go:136","timestamp":"2025-06-09T19:51:40.071392Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} 
      
      {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration as been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"server.go:136","timestamp":"2025-06-09T19:51:40.077091Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      
      3. Event comes that we are into switchover phase
      {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event Domain event=\"suspended\" detail=\"migrated\" with event id 3 reason 1 received","pos":"client.go:481","timestamp":"2025-06-09T19:51:40.170293Z"}
      
      4. Migration is aborted after entering switchover
      {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration abort succeeded","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:718","timestamp":"2025-06-09T19:51:40.173127Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      
      5. Resume the VM again on source
      {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event Domain event=\"resumed\" detail=\"unpaused\" with event id 4 reason 0 received","pos":"client.go:481","timestamp":"2025-06-09T19:51:40.173268Z"}
      
      {"component":"virt-launcher","level":"info","msg":"Domain name event: kubevirt-0-1749496987_fio-vm-multi-disk","pos":"client.go:435","timestamp":"2025-06-09T19:51:40.176666Z"}
      {"component":"virt-launcher","level":"info","msg":"Domain name event: kubevirt-0-1749496987_fio-vm-multi-disk","pos":"client.go:435","timestamp":"2025-06-09T19:51:40.181104Z"}
      
      6. Why are we getting migration statistics if it was cancelled?
      {"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 2854a6d7-c569-431e-bb45-cb41f66caa3d: TimeElapsed:4003ms DataProcessed:1MiB DataRemaining:0MiB DataTotal:1MiB MemoryProcessed:0MiB MemoryRemaining:0MiB MemoryTotal:0MiB MemoryBandwidth:0Mbps DirtyRate:0Mbps Iteration:0 PostcopyRequests:0 ConstantPages:0 NormalPages:0 NormalData:0MiB ExpectedDowntime:0ms DiskMbps:0","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:685","timestamp":"2025-06-09T19:51:41.962866Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      ...
      {"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 2854a6d7-c569-431e-bb45-cb41f66caa3d: TimeElapsed:747689ms DataProcessed:1MiB DataRemaining:0MiB DataTotal:1MiB MemoryProcessed:0MiB MemoryRemaining:0MiB MemoryTotal:0MiB MemoryBandwidth:0Mbps DirtyRate:0Mbps Iteration:0 PostcopyRequests:0 ConstantPages:0 NormalPages:0 NormalData:0MiB ExpectedDowntime:0ms DiskMbps:0","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:685","timestamp":"2025-06-09T20:04:05.649309Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      
      7. Times out on virt-launcher side (but wasn't it cancelled)
      
      {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration abort detected with reason: Live migration is not completed after 750 seconds and has been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:649","timestamp":"2025-06-09T20:04:09.270034Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      
      ... 2 days later ...
      
      8. Try again, to find out the libvirt job was not cancelled on the "abort" above.
      
      {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:1016","reason":"error encountered during MigrateToURI3 libvirt api call: virError(Code=55, Domain=10, Message='Requested operation is not valid: another migration job is already running for domain 'kubevirt-0-1749496987_fio-vm-multi-disk'')","timestamp":"2025-06-11T12:12:43.508933Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
      

              jelejosne Jed Lejosne
              rhn-support-gveitmic Germano Veit Michel
              Denys Shchedrivyi Denys Shchedrivyi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: