Uploaded image for project: 'Migration Toolkit for Virtualization'
  1. Migration Toolkit for Virtualization
  2. MTV-4436

Warm Migration delayed by ~8 hours at "WaitForPenultimateSnapshotRemoval" due to vSphere snapshot consolidation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 2.11.0
    • Controller
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • True
    • Proposed

      During the execution of TC 2.5 Warm Migration,   on Netapp 

      We are using a single 1TB VM with 360GB of baseline data. An fio script runs overnight to generate data automatically. Crucially, the data generation starts after the first snapshot is taken
      this ensures the first snapshot captures the 360GB baseline, while subsequent hourly snapshots capture the active changes. We will perform the cut-over in the morning.

      a significant performance degradation was observed during the cutover phase. The migration flow was effectively "stuck" for approximately 8 hours after the cutover was requested, waiting for a vSphere snapshot removal task to complete before proceeding to the actual cutover.

      This delay caused the total migration time to exceed 17 hours, causing automation timeouts and presenting a risk for customer use cases involving large data volumes.

      Steps to Reproduce / Timeline

      1. Start Plan: Warm migration initiated (Jan 20, 09:16 UTC).
      1. Snapshot Phase: 4 snapshots created and transferred successfully.
      1. Cutover Request: Triggered at Jan 20, 18:26 UTC.
      1. Issue Occurred: The system entered a wait state for WaitForPenultimateSnapshotRemoval.
        • Task ID: haTask-3170-vim.vm.Snapshot.remove-1921576353
        • Duration: The system waited ~8 hours for vSphere to consolidate/delete the previous snapshot (3170-snapshot-9).
      1. Cutover Start: Migration only resumed at Jan 21, 02:13 UTC (immediately after snapshot removal was confirmed).
      1. Completion: Plan finished at Jan 21, 02:29 UTC.

      Observed Behavior

      The MTV controller blocks the migration workflow synchronously while waiting for the WaitForPenultimateSnapshotRemoval step to complete. On large VMs/Datastores, vSphere snapshot consolidation can take hours, causing the migration to appear hung and delaying the final cutover significantly.

      Expected Behavior

      The migration workflow should ideally not be blocked for extended periods by intermediate snapshot consolidation, or an optimization should be applied to handle this process asynchronously to ensure a predictable cutover window.

      Logs & Evidence

      • Snapshot Removal Task: haTask-3170-vim.vm.Snapshot.remove-1921576353
      • Stuck Interval: Jan 20 18:00 UTC – Jan 21 02:17 UTC
      • Total Duration: ~17 hours 14 minutes

      Investigation / Proposed Fix (From Email Discussion)

      • Root Cause: The delay is caused by the vSphere storage backend performing a heavy "Consolidate Helper" operation on the snapshot chain.
      • Developer Notes: We currently wait for WaitForPenultimateSnapshotRemoval.
      • Manager/Team Suggestion for investigation:
        • Investigate if we can disable consolidation for the removal of final snapshots to speed up the process.
        • Investigate if consolidation can be performed asynchronously (background process) so the migration workflow does not wait on it.
        • Note: Need to verify if disabling consolidation impacts source VM performance or causes vSphere to refuse creating subsequent snapshots.

      full logs can be found on cloud15 provisioner host  under: 

      /home/kni/Tzahi_MTV/mtv-debug-1vm-1tb-366gb-usage-dynamic-fio-warm-tc2-5-20260122-110322 

       

      ============================================================
       Warm Migration Details
      ============================================================
      [INFO] VM: fio-1tb-warm
        ----------------------------------------
        Precopy Summary:
          Total precopies: 5
          Successes: 4
          Failures: 0  Snapshot/Precopy Breakdown:
          Precopy #1: 3170-snapshot-6
            Started:  2026-01-20T09:16:13Z
            Ended:    2026-01-20T11:34:22Z
            Duration: 02:18:09
            Deltas transferred: 1
          Precopy #2: 3170-snapshot-7
            Started:  2026-01-20T13:24:24Z
            Ended:    2026-01-20T14:05:54Z
            Duration: 00:41:30
            Deltas transferred: 1
          Precopy #3: 3170-snapshot-8
            Started:  2026-01-20T15:33:13Z
            Ended:    2026-01-20T15:56:14Z
            Duration: 00:23:01
            Deltas transferred: 1
          Precopy #4: 3170-snapshot-9
            Started:  2026-01-20T17:19:38Z
            Ended:    2026-01-20T17:40:00Z
            Duration: 00:20:22
            Deltas transferred: 1
          Precopy #5: 3170-snapshot-10
            Started:  2026-01-21T02:13:26Z  Cutover Details:
          Phase: Completed
          Started:   2026-01-21T02:13:32Z
          Completed: 2026-01-21T02:26:31Z
          Duration:  00:12:59  vSphere Snapshot Removal/Consolidation:
          Last completed precopy: 3170-snapshot-9
          Precopy ended:          2026-01-20T17:40:00Z
          Cutover started:        2026-01-21T02:13:32Z
          Wait time:              08:33:32
      [WARN]   Long snapshot removal time detected (08:33:32)
      [INFO]   This delay is vSphere-side snapshot consolidation
          Disk Cutover Tasks:
            - 1tb-fio-warm_3.vmdk...
              Status: Completed | Precopies: 5 | Data: 1048576/1048576 MB  Disk Transfer (Incremental):
          Started:   2026-01-20T09:18:27Z
          Completed: 2026-01-20T17:40:00Z
          Duration:  08:21:33
          Data transferred: 1048576/1048576 MB
          Transfer rate: ~34 MB/s
       

              marnold@redhat.com Matthew Arnold
              tzahia Tzahi Ashkenazi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: