-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
CLOSED
-
Moderate
-
No
Description of problem:
The migration progress timeout is there to ensure that migration packets keep getting transferred from source to target.
If no activity happens for the defined amount of time (2.5 minutes by default), the migration is cancelled.
However, the current implementation expects the remaining data counter to make absolute progress within that time. By "absolute progress", I mean going down lower than ever before. If the remaining data goes up, which can happen for various reasons, then subsequent progress will not count as long as the value doesn't go back down below its lowest ever.
This is unreasonable in many scenarios, the worst case being a very active VM with lots of RAM and a slow network.
Instead, we should expect relative progress, resetting the timer every time the remaining data goes down from one poll to the next. That will effectively ensure data is flowing, without worrying about eventual convergence, which is ensured by other mechanisms.