-
Story
-
Resolution: Unresolved
-
Normal
-
rhel-9.4
-
qemu-kvm-9.1.0-1.el9
-
ZStream
-
rhel-sst-virtualization
-
ssg_virtualization
-
8
-
False
-
-
None
-
None
-
Approved Blocker
-
Pass
-
Manual
-
-
x86_64
-
None
What were you trying to do that didn't work?
When migration is in "postcopy-paused" state, trying to resume the migration
using "migrate resume=true" immediately reports success (which is fine, the
QMP call is asynchronous) and there's no way libvirt (or just anyone using
QMP) can detect it failed. If the attempt to resume post-copy migration
suceeds, we can see migration events reporting state changes
(postcopy-recover, postcopy-active) and if migration fails again,
postcopy-paused is reported again. But if the attempt fails (e.g., the
connection to the destination fails), there's no state change or event
reported and migration just stays in postcopy-paused. The only visible
(sometimes if we're lucky) thing is a changed "error-desc" field in
query-migrate response. Of course this only works when the new error is
different from the one which caused migration to be paused originally. And
well, the error is also printed on stderr, but this is not usable either. Thus
it is impossible to tell whether resume has not started yet or it failed
again.
Please provide the package NVR for which bug is seen:
qemu-kvm-8.2.0-11.el9_4
How reproducible:
100%
Steps to reproduce
- start post-copy migration
- once migration is in postcopy-active state, call "migrate-pause"
- block incoming connection on migration ports on the destination host (firewall-cmd --zone=public --remove-port=49152-49215/tcp)
- call "migrate" command with resume=true
- the call returns success
- check stderr for the connection error to be reported
- no MIGRATION event has emitted since "migrate" was called
Actual results
No events if resume fails again.
Expected results
An event in both successful and failure scenario so that we know migration is
running or failed again.
An ideal solution for libvirt would be introducing a new migration state
(e.g., postcopy-recover-setup or something similar) which would be entered and
reported by a MIGRATION event before "migrate" QMP command returns. On success
the state would normally change to postcopy-recover and later to
postcopy-active. But in case the resume attempt fails before entering
postcopy-recover, the state would change back to postcopy-paused and the
corresponding MIGRATION event would be emitted.
This way we could easily detect we're talking to fixed QEMU as an old QEMU
would not report the new state in a MIGRATION event before "migrate" QMP
command returns. And we can reliably wait for either failure or success.
- blocks
-
RHEL-22166 Recover postcopy returned with error '"job 'migration in' failed in post-copy phase' but the migration was recovered successfully in fact
- Release Pending
- links to
-
RHBA-2024:139949 qemu-kvm update