Uploaded image for project: 'Migration Toolkit for Virtualization'
  1. Migration Toolkit for Virtualization
  2. MTV-1656

Incorrect ordering of snapshots lead to XFS filesystem corruption after warm migration of VM from VMware

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 2.7.4
    • None
    • Controller
    • False
    • None
    • True
    • Critical

      Description of problem:

      Destination VM is observing XFS filesystem corruption after migrating using VMware VM to OpenShift Virtualization using MTV.

      Followed the below test steps: 

      [1] VM have 2 disks, 2nd disk is mounted on /test.

      Started fio with target /test on the VMware VM:

      fio --rw=randrw --ioengine=libaio  --directory=/test/ --size=10G --bs=4k --name=vm_perf --runtime=12000 --time_based  --iodepth=512  --group_reporting

      The runtime is kept high so that it always runs during the migration.

      [2] Reduced the “controller_precopy_interval” to 10 minutes for faster precopy.

      [3] Started a warm migration.

      After the initial process of the full download of the virtual disks, when the DVs were in paused state, I manually tried to map and mount the rbd device from the worker node and it was successful:

      [root@openshift-worker-deneb-0 ~]# mount /dev/rbd3 /mnt/
      [root@openshift-worker-deneb-0 ~]# ls /mnt/
      test.out  test2.out  vm_perf.0.0
      

      [4] Then waited for the first incremental copy to complete. Tried again to mount the disk when the DV was in paused state. It failed with following error:

      [root@openshift-worker-deneb-0 ~]# mount /dev/rbd3 /mnt/
      mount: /var/mnt: wrong fs type, bad option, bad superblock on /dev/rbd3, missing codepage or helper program, or other error.
      

      Journalctl:

      Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): Mounting V5 Filesystem e1483691-a7e0-489a-aa73-fd469a7fcdd6
      Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): Corruption warning: Metadata has LSN (1:5851) ahead of current LSN (1:5823). Please unmoun>
      Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): log mount/recovery failed: error -22
      Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): log mount failed
      

      So the problem is during the incremental copy.

      [5] Did the cutover, the virt-v2v have the same error.

      command: mount '-o' 'ro' '/dev/sdb' '/sysroot//'^M
      [    7.904799] SGI XFS with ACLs, security attributes, scrub, quota, no debug enabled^M
      [    7.914852] XFS (sdb): Mounting V5 Filesystem e1483691-a7e0-489a-aa73-fd469a7fcdd6^M
      [    8.027063] XFS (sdb): Corruption warning: Metadata has LSN (1:5905) ahead of current LSN (1:5823). Please unmount and run xfs_repair (>= v4.3) to resolve.^M
      [    8.031136] XFS (sdb): log mount/recovery failed: error -22^M
      [    8.033444] XFS (sdb): log mount failed^M
      

      VM boots into maintenance mode because it’s unable to mount the second disk.

      Note that [4] and [5] was to understand at which stage, we are getting a corrupted disk and ensured that DV is in pause while doing that. Issue is reproducible without those steps.

      One thing I noticed during the test is that forklift is deleting the snapshot immediately after it sets checkpoints on the DVs. So when the importer pod spawns and does QueryChangedDiskAreas, it  is waiting for the  snapshot delete operation to finish. 

      I1110 07:20:54.685177       1 vddk-datasource_amd64.go:298] Current VM power state: poweredOff
      < – Blocked during snapshot delete ->
      I1110 07:24:22.450431       1 vddk-datasource_amd64.go:900] Set disk file name from current snapshot: [nfs-dell-per7525-02] rhel9/rhel9_2.vmdk
      

      As per https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vddk-programming-guide/GUID-C9D1177E-009E-4366-AB11-6CD99EF82B5B.html , shouldn’t we delete the snapshot first before creating another and start the download? But not sure if this is creating problem.

      Version-Release number of selected component (if applicable):

      Migration Toolkit for Virtualization Operator   2.7.3

      OpenShift Virtualization   4.17.0

      Guest is RHEL 9.4

      How reproducible:

      I can consistently reproduce the problem in my test environment with fio running. 

      Actual results:

      XFS filesystem corruption after warm migration of VM from VMware

      Expected results:

      No corruption

        1. cnv-51521-vm-662-kzxvt-virt-v2v.log
          4.86 MB
        2. forklift-controller-55f4fdc5df-pwr4b-main-newimage.log
          1.27 MB
        3. forklift-controller-56d6b9d99f-dvp2g-main-reproduce.log
          1.07 MB
        4. forklift-controller-64db57c564-vr9cz-main-newimage-interval2.log
          3.68 MB
        5. forklift-controller-64db57c564-vr9cz-main-newmage-2.log
          1.47 MB
        6. forklift-controller-64db57c564-vr9cz-main-newmage-2-1.log
          1.47 MB
        7. forklift-controller-755f899f54-n9w58-main.log
          40.60 MB
        8. forklift-controller-75bdb44d8c-hl2tl-main.log
          1.72 MB
        9. forklift-controller-75bdb44d8c-zb8j9-main.log
          1.64 MB
        10. forklift-controller-7976889cc4-2bp26.log
          1.52 MB
        11. forklift-controller-7d9c89c656-v4ggr-main.log
          23.62 MB
        12. image-2024-11-12-13-48-49-594.png
          image-2024-11-12-13-48-49-594.png
          66 kB
        13. image-2024-11-12-14-06-38-091.png
          image-2024-11-12-14-06-38-091.png
          92 kB
        14. image-2024-11-13-14-22-39-889.png
          image-2024-11-13-14-22-39-889.png
          71 kB
        15. image-2024-11-13-14-23-15-075.png
          image-2024-11-13-14-23-15-075.png
          71 kB
        16. image-2024-11-13-14-30-16-193.png
          image-2024-11-13-14-30-16-193.png
          33 kB
        17. image-2024-11-13-16-29-12-686.png
          image-2024-11-13-16-29-12-686.png
          237 kB
        18. image-2024-11-13-17-02-46-462.png
          image-2024-11-13-17-02-46-462.png
          127 kB
        19. image-2024-11-18-11-55-19-887.png
          image-2024-11-18-11-55-19-887.png
          3 kB
        20. image-2024-11-18-12-06-29-157.png
          image-2024-11-18-12-06-29-157.png
          3 kB
        21. image-2024-11-19-15-22-06-419.png
          image-2024-11-19-15-22-06-419.png
          158 kB
        22. image-2024-11-19-15-41-12-854.png
          image-2024-11-19-15-41-12-854.png
          115 kB
        23. image-2024-11-19-21-34-52-589.png
          image-2024-11-19-21-34-52-589.png
          162 kB
        24. image-2024-11-21-15-11-06-211.png
          image-2024-11-21-15-11-06-211.png
          191 kB
        25. image-2024-11-22-12-57-14-721.png
          image-2024-11-22-12-57-14-721.png
          46 kB
        26. image-2024-11-22-13-00-36-283.png
          image-2024-11-22-13-00-36-283.png
          154 kB
        27. logs_cutover.tar.gz
          31.30 MB
        28. logs.tar
          2.98 MB
        29. mtv-1656-416-vm-662-pmzdj-virt-v2v.log
          4.85 MB
        30. mtv-1656-416-vm-662-slh8x.log
          4.85 MB
        31. mtv-1656-newimage-2-vm-678-6lqqw-virt-v2v.log
          4.83 MB
        32. mtv-1656-newimage-vm-678-nzcff-virt-v2v.log
          4.82 MB
        33. mtv-1656-vm-662-mdn44-virt-v2v.log
          4.80 MB
        34. mtv-1656-vm-678-46z7t-virt-v2v.log
          4.83 MB
        35. mtv-xfs-rhel8-vm-678-2mq5s-virt-v2v.log
          4.81 MB

              slucidi@redhat.com Samuel Lucidi
              rhn-support-nashok Nijin Ashok
              Chenli Hu Chenli Hu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: