Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-17988

[2082008] LiveMigration fails due to loss of connection to destination host

XMLWordPrintable

    • CNV Virtualization Sprint 220, CNV Virtualization Sprint 221
    • Important
    • No

      Description of problem:
      During an OCP upgrade a Windows VMI continuously tried to live-migrate failing in a loop with 2 different errors:

      1. On the first attempt the migration gets aborted due to being stuck for more than X seconds.

      2. On the second attempt the migration aborts due to an apparent network error, reported in the source virt-launcher (lost connection to destination host).

      These 2 errors happened continuously exactly in this order until the VMI managed to eventually migrate.

      While the 1st error is a condition that can occur due to a bandwidth saturation problem or the dirty rate of the VM being too high the 2nd one is a bug that actually originates from the source virt-handler that detects a "migration job already executed" condition and tears down all the migration proxies which makes the destination host unreacheable.

      Version-Release number of selected component (if applicable):
      CNV 4.8.1

      How reproducible:

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

      The bug could be similar to what was fixed in https://github.com/kubevirt/kubevirt/pull/7582, which was due to detecting a migration takeover from another migration object without waiting for the Informer cache to be up-to-date, something similar might be happening here.

              acardace@redhat.com Antonio Cardace
              acardace@redhat.com Antonio Cardace
              Vasiliy Sibirskiy Vasiliy Sibirskiy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: