-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
postcopy-cutover
-
Incidents & Support
-
False
-
-
False
-
None
-
To Do
-
100% To Do, 0% In Progress, 0% Done
Currently the logic will wait for CompletionTimeoutPerGiB to expire, and trigger post-copy at the end.
While CompletionTimeoutPerGiB is customizable, this is not ideal because:
- A simple timeout based on VM size does not take into account transfer speeds and transfer rates, it will always deterministically trigger post-copy at the exact time based on VM RAM size. It doesn't matter if the network is quick, if the migration is stalled or not, if its progressing well or not in pre-copy.
- So it can be too short (not enough pre-copy time), leaving too much to page fault over the network
- It can be too long (too much time wasted in pre-copy phase with no progress being made) - with the default value of CompletionTimeoutPerGiB this is what will happen.
See this:
Migration start 2025-09-24T13:41
{"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 6b4a1b90-9d8e-4e58-b32d-b1ba1d559cfc: TimeElapsed:3990ms DataProcessed:1178MiB DataRemaining:49767MiB DataTotal:51220MiB MemoryProcessed:1178MiB MemoryRemaining:49767MiB MemoryTotal:51220MiB MemoryBandwidth:2963Mbps DirtyRate:0Mbps Iteration:1 PostcopyRequests:0 ConstantPages:70889 NormalPages:300957 NormalData:1175MiB ExpectedDowntime:300ms DiskMbps:0","name":"ambw399x","namespace":"removed,"pos":"live-migration-source.go:710","timestamp":"2025-09-24T13:41:27.168197Z","uid":"ecf655a2-219c-420e-b7e3-3d4f86bc4af9"}
Times out to trigger post-copy 1h30m later, at 2025-09-24T15:11:
{"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 6b4a1b90-9d8e-4e58-b32d-b1ba1d559cfc: TimeElapsed:5399135ms DataProcessed:2944642MiB DataRemaining:1932MiB DataTotal:51220MiB MemoryProcessed:2944642MiB MemoryRemaining:1932MiB MemoryTotal:51220MiB MemoryBandwidth:6855Mbps DirtyRate:4120Mbps Iteration:528 PostcopyRequests:0 ConstantPages:36811578 NormalPages:752277821 NormalData:2938585MiB ExpectedDowntime:4482ms DiskMbps:0","name":"ambw399x","namespace":"removed","pos":"live-migration-source.go:710","timestamp":"2025-09-24T15:11:22.312570Z","uid":"ecf655a2-219c-420e-b7e3-3d4f86bc4af9"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Starting post copy mode for migration","name":"ambw399x","namespace":"removed","pos":"live-migration-source.go:542","timestamp":"2025-09-24T15:11:24.324884Z","uid":"ecf655a2-219c-420e-b7e3-3d4f86bc4af9"}
If we do this for every VM on the nodes, consider an OCP upgrade where all nodes are upgraded and this will blow out the upgrade time to days. Now do CNV upgrades too and a customer going from 4.16 to 4.20 for example, and this will take weeks. So we need to trigger post-copy earlier, and we need a better method to detect when.
The key here is there was no progress after this, just 144MiB remaining at 13:43, which is about 20m into the migration.
{"component":"virt-launcher","kind":"","level":"info","msg":"Migration info for 6b4a1b90-9d8e-4e58-b32d-b1ba1d559cfc: TimeElapsed:120626ms DataProcessed:58481MiB DataRemaining:144MiB DataTotal:51220MiB MemoryProcessed:58481MiB MemoryRemaining:144MiB MemoryTotal:51220MiB MemoryBandwidth:2602Mbps DirtyRate:1078Mbps Iteration:2 PostcopyRequests:0 ConstantPages:1244241 NormalPages:14939379 NormalData:58356MiB ExpectedDowntime:39728ms DiskMbps:0","name":"ambw399x","namespace":"removed","pos":"live-migration-source.go:710","timestamp":"2025-09-24T13:43:23.803808Z","uid":"ecf655a2-219c-420e-b7e3-3d4f86bc4af9"}
From here, MemoryRemaining does not go lower, it should start a timeout here and wait about 1m. If there is no real progress (MemoryRemaining < 144MiB), trigger post copy.
So it should have triggerred post-copy at around 13:44, not at 15:11. This would save a lot of time and network resources on node drains.
This stall detection is what RHV does as well, its been used and tested for a long time.
Simply put, when pre-copy stops being effective -> trigger post-copy.
In the original CNV-71164 there was some confusion that ProgressTimeout was supposed to do this, but jelejosne clarified that's supposed to be a network timeout. So this is now an RFE.
And ProgressTimeout needs to be fixed too, see CNV-72386
- clones
-
CNV-72387 Need logic to trigger post-copy at more ideal time.
-
- New
-