Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-38485

Failure to resume paused post-copy migration is undetectable

    • qemu-kvm-9.1.0-1.el9
    • None
    • ZStream
    • rhel-sst-virtualization
    • ssg_virtualization
    • 8
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Approved Blocker
    • x86_64
    • None

      What were you trying to do that didn't work?

      When migration is in "postcopy-paused" state, trying to resume the migration
      using "migrate resume=true" immediately reports success (which is fine, the
      QMP call is asynchronous) and there's no way libvirt (or just anyone using
      QMP) can detect it failed. If the attempt to resume post-copy migration
      suceeds, we can see migration events reporting state changes
      (postcopy-recover, postcopy-active) and if migration fails again,
      postcopy-paused is reported again. But if the attempt fails (e.g., the
      connection to the destination fails), there's no state change or event
      reported and migration just stays in postcopy-paused. The only visible
      (sometimes if we're lucky) thing is a changed "error-desc" field in
      query-migrate response. Of course this only works when the new error is
      different from the one which caused migration to be paused originally. And
      well, the error is also printed on stderr, but this is not usable either. Thus
      it is impossible to tell whether resume has not started yet or it failed
      again.

      Please provide the package NVR for which bug is seen:

      qemu-kvm-8.2.0-11.el9_4

      How reproducible:

      100%

      Steps to reproduce

      1. start post-copy migration
      2. once migration is in postcopy-active state, call "migrate-pause"
      3. block incoming connection on migration ports on the destination host (firewall-cmd --zone=public --remove-port=49152-49215/tcp)
      4. call "migrate" command with resume=true
      5. the call returns success
      6. check stderr for the connection error to be reported
      7. no MIGRATION event has emitted since "migrate" was called

      Actual results

      No events if resume fails again.

      Expected results

      An event in both successful and failure scenario so that we know migration is
      running or failed again.

      An ideal solution for libvirt would be introducing a new migration state
      (e.g., postcopy-recover-setup or something similar) which would be entered and
      reported by a MIGRATION event before "migrate" QMP command returns. On success
      the state would normally change to postcopy-recover and later to
      postcopy-active. But in case the resume attempt fails before entering
      postcopy-recover, the state would change back to postcopy-paused and the
      corresponding MIGRATION event would be emitted.

      This way we could easily detect we're talking to fixed QEMU as an old QEMU
      would not report the new state in a MIGRATION event before "migrate" QMP
      command returns. And we can reliably wait for either failure or success.

              zhexu@redhat.com Peter Xu
              jdenemar@redhat.com Jiri Denemark
              virt-maint virt-maint
              Xiaohui Li Xiaohui Li
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: