-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
CLOSED
-
CNV Virtualization Sprint 220, CNV Virtualization Sprint 221
-
Important
-
No
Description of problem:
During an OCP upgrade a Windows VMI continuously tried to live-migrate failing in a loop with 2 different errors:
1. On the first attempt the migration gets aborted due to being stuck for more than X seconds.
2. On the second attempt the migration aborts due to an apparent network error, reported in the source virt-launcher (lost connection to destination host).
These 2 errors happened continuously exactly in this order until the VMI managed to eventually migrate.
While the 1st error is a condition that can occur due to a bandwidth saturation problem or the dirty rate of the VM being too high the 2nd one is a bug that actually originates from the source virt-handler that detects a "migration job already executed" condition and tears down all the migration proxies which makes the destination host unreacheable.
Version-Release number of selected component (if applicable):
CNV 4.8.1
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
The bug could be similar to what was fixed in https://github.com/kubevirt/kubevirt/pull/7582, which was due to detecting a migration takeover from another migration object without waiting for the Informer cache to be up-to-date, something similar might be happening here.