-
Bug
-
Resolution: Unresolved
-
Normal
-
CNV v4.18.3
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
Moderate
-
None
Description of problem:
Customer is testing migrations by creating and deleting them in a loop. It appears that cancelling a migration at a very specific point in time - the switchover phase at the end of precopy, where the VM is briefly suspended to copy the remaining pages - the system will lose track of the migration: pretending to cancel it (the VMIM is deleted and the logs indicate it cancelled), but the libvirt job is left "running" indefinitely, blocking further migrations.
Version-Release number of selected component (if applicable):
4.18.3
How reproducible:
Takes a while but reproduces, may need to tune the sleep to allow for destination pod preparation $ cat migrate.sh #!/bin/bash while true; do virtctl migrate rhel-94 sleep 7 virtctl migrate-cancel rhel-94 done
Steps to Reproduce:
1. Create and delete migrations in a loop
Actual results:
* Migration is left running on libvirt side * Cannot migrate again
Expected results:
* Clean up and also cancel the libvirt job
Additional info:
1. Migration initiated {"component":"virt-launcher","kind":"","level":"info","msg":"Initiating live migration.","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:1003","timestamp":"2025-06-09T19:51:37.945455Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} {"component":"virt-launcher","kind":"","level":"info","msg":"generated migration parameters: [....] 2. Immediatelly cancel: {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration as been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"server.go:136","timestamp":"2025-06-09T19:51:40.071392Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration as been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"server.go:136","timestamp":"2025-06-09T19:51:40.077091Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} 3. Event comes that we are into switchover phase {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event Domain event=\"suspended\" detail=\"migrated\" with event id 3 reason 1 received","pos":"client.go:481","timestamp":"2025-06-09T19:51:40.170293Z"} 4. Migration is aborted after entering switchover {"component":"virt-launcher","kind":"","level":"info","msg":"Live migration abort succeeded","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:718","timestamp":"2025-06-09T19:51:40.173127Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} 5. Resume the VM again on source {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event Domain event=\"resumed\" detail=\"unpaused\" with event id 4 reason 0 received","pos":"client.go:481","timestamp":"2025-06-09T19:51:40.173268Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: kubevirt-0-1749496987_fio-vm-multi-disk","pos":"client.go:435","timestamp":"2025-06-09T19:51:40.176666Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: kubevirt-0-1749496987_fio-vm-multi-disk","pos":"client.go:435","timestamp":"2025-06-09T19:51:40.181104Z"} 6. Why are we getting migration statistics if it was cancelled? {"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 2854a6d7-c569-431e-bb45-cb41f66caa3d: TimeElapsed:4003ms DataProcessed:1MiB DataRemaining:0MiB DataTotal:1MiB MemoryProcessed:0MiB MemoryRemaining:0MiB MemoryTotal:0MiB MemoryBandwidth:0Mbps DirtyRate:0Mbps Iteration:0 PostcopyRequests:0 ConstantPages:0 NormalPages:0 NormalData:0MiB ExpectedDowntime:0ms DiskMbps:0","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:685","timestamp":"2025-06-09T19:51:41.962866Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} ... {"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 2854a6d7-c569-431e-bb45-cb41f66caa3d: TimeElapsed:747689ms DataProcessed:1MiB DataRemaining:0MiB DataTotal:1MiB MemoryProcessed:0MiB MemoryRemaining:0MiB MemoryTotal:0MiB MemoryBandwidth:0Mbps DirtyRate:0Mbps Iteration:0 PostcopyRequests:0 ConstantPages:0 NormalPages:0 NormalData:0MiB ExpectedDowntime:0ms DiskMbps:0","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:685","timestamp":"2025-06-09T20:04:05.649309Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} 7. Times out on virt-launcher side (but wasn't it cancelled) {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration abort detected with reason: Live migration is not completed after 750 seconds and has been aborted","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:649","timestamp":"2025-06-09T20:04:09.270034Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"} ... 2 days later ... 8. Try again, to find out the libvirt job was not cancelled on the "abort" above. {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"fio-vm-multi-disk","namespace":"kubevirt-0-1749496987","pos":"live-migration-source.go:1016","reason":"error encountered during MigrateToURI3 libvirt api call: virError(Code=55, Domain=10, Message='Requested operation is not valid: another migration job is already running for domain 'kubevirt-0-1749496987_fio-vm-multi-disk'')","timestamp":"2025-06-11T12:12:43.508933Z","uid":"332aa27b-b3f3-4f5a-93bd-0f12fef37995"}
- depends on
-
RHEL-97807 Lost track of migration when cancelled during pre-copy "switchover" phase
-
- New
-
- links to