Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 2.7.4
Affects Version/s: None
Component/s: Controller
Labels:
- vmware

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
True
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Destination VM is observing XFS filesystem corruption after migrating using VMware VM to OpenShift Virtualization using MTV.

Followed the below test steps:

[1] VM have 2 disks, 2nd disk is mounted on /test.

Started fio with target /test on the VMware VM:

fio --rw=randrw --ioengine=libaio  --directory=/test/ --size=10G --bs=4k --name=vm_perf --runtime=12000 --time_based  --iodepth=512  --group_reporting

The runtime is kept high so that it always runs during the migration.

[2] Reduced the “controller_precopy_interval” to 10 minutes for faster precopy.

[3] Started a warm migration.

After the initial process of the full download of the virtual disks, when the DVs were in paused state, I manually tried to map and mount the rbd device from the worker node and it was successful:

[root@openshift-worker-deneb-0 ~]# mount /dev/rbd3 /mnt/
[root@openshift-worker-deneb-0 ~]# ls /mnt/
test.out  test2.out  vm_perf.0.0

[4] Then waited for the first incremental copy to complete. Tried again to mount the disk when the DV was in paused state. It failed with following error:

[root@openshift-worker-deneb-0 ~]# mount /dev/rbd3 /mnt/
mount: /var/mnt: wrong fs type, bad option, bad superblock on /dev/rbd3, missing codepage or helper program, or other error.

Journalctl:

Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): Mounting V5 Filesystem e1483691-a7e0-489a-aa73-fd469a7fcdd6
Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): Corruption warning: Metadata has LSN (1:5851) ahead of current LSN (1:5823). Please unmoun>
Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): log mount/recovery failed: error -22
Nov 10 07:16:50 openshift-worker-deneb-0 kernel: XFS (rbd3): log mount failed

So the problem is during the incremental copy.

[5] Did the cutover, the virt-v2v have the same error.

command: mount '-o' 'ro' '/dev/sdb' '/sysroot//'^M
[    7.904799] SGI XFS with ACLs, security attributes, scrub, quota, no debug enabled^M
[    7.914852] XFS (sdb): Mounting V5 Filesystem e1483691-a7e0-489a-aa73-fd469a7fcdd6^M
[    8.027063] XFS (sdb): Corruption warning: Metadata has LSN (1:5905) ahead of current LSN (1:5823). Please unmount and run xfs_repair (>= v4.3) to resolve.^M
[    8.031136] XFS (sdb): log mount/recovery failed: error -22^M
[    8.033444] XFS (sdb): log mount failed^M

VM boots into maintenance mode because it’s unable to mount the second disk.

Note that [4] and [5] was to understand at which stage, we are getting a corrupted disk and ensured that DV is in pause while doing that. Issue is reproducible without those steps.

One thing I noticed during the test is that forklift is deleting the snapshot immediately after it sets checkpoints on the DVs. So when the importer pod spawns and does QueryChangedDiskAreas, it is waiting for the snapshot delete operation to finish.

I1110 07:20:54.685177       1 vddk-datasource_amd64.go:298] Current VM power state: poweredOff
< – Blocked during snapshot delete ->
I1110 07:24:22.450431       1 vddk-datasource_amd64.go:900] Set disk file name from current snapshot: [nfs-dell-per7525-02] rhel9/rhel9_2.vmdk

As per https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vddk-programming-guide/GUID-C9D1177E-009E-4366-AB11-6CD99EF82B5B.html , shouldn’t we delete the snapshot first before creating another and start the download? But not sure if this is creating problem.

Version-Release number of selected component (if applicable):

Migration Toolkit for Virtualization Operator 2.7.3

OpenShift Virtualization 4.17.0

Guest is RHEL 9.4

How reproducible:

I can consistently reproduce the problem in my test environment with fio running.

Actual results:

XFS filesystem corruption after warm migration of VM from VMware

Expected results:

No corruption

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

logs.tar
2.98 MB
2024/11/11 5:28 AM
logs_cutover.tar.gz
31.30 MB
2024/11/11 2:57 PM
image-2024-11-12-13-48-49-594.png
66 kB
2024/11/12 5:48 AM
image-2024-11-12-14-06-38-091.png
92 kB
2024/11/12 6:06 AM
mtv-xfs-rhel8-vm-678-2mq5s-virt-v2v.log
4.81 MB
2024/11/12 6:07 AM
forklift-controller-56d6b9d99f-dvp2g-main-reproduce.log
1.07 MB
2024/11/12 6:07 AM
mtv-1656-vm-662-mdn44-virt-v2v.log
4.80 MB
2024/11/12 6:10 AM
forklift-controller-755f899f54-n9w58-main.log
40.60 MB
2024/11/12 6:13 AM
image-2024-11-13-14-22-39-889.png
71 kB
2024/11/13 6:22 AM
image-2024-11-13-14-23-15-075.png
71 kB
2024/11/13 6:23 AM
mtv-1656-newimage-vm-678-nzcff-virt-v2v.log
4.82 MB
2024/11/13 6:28 AM
forklift-controller-55f4fdc5df-pwr4b-main-newimage.log
1.27 MB
2024/11/13 6:28 AM
image-2024-11-13-14-30-16-193.png
33 kB
2024/11/13 6:30 AM
forklift-controller-64db57c564-vr9cz-main-newmage-2.log
1.47 MB
2024/11/13 7:05 AM
mtv-1656-newimage-2-vm-678-6lqqw-virt-v2v.log
4.83 MB
2024/11/13 7:06 AM
forklift-controller-64db57c564-vr9cz-main-newmage-2-1.log
1.47 MB
2024/11/13 7:10 AM
image-2024-11-13-16-29-12-686.png
237 kB
2024/11/13 8:29 AM
forklift-controller-64db57c564-vr9cz-main-newimage-interval2.log
3.68 MB
2024/11/13 8:41 AM
image-2024-11-13-17-02-46-462.png
127 kB
2024/11/13 9:02 AM
mtv-1656-vm-678-46z7t-virt-v2v.log
4.83 MB
2024/11/15 11:45 AM
forklift-controller-75bdb44d8c-zb8j9-main.log
1.64 MB
2024/11/15 11:46 AM
mtv-1656-416-vm-662-pmzdj-virt-v2v.log
4.85 MB
2024/11/15 1:29 PM
forklift-controller-75bdb44d8c-hl2tl-main.log
1.72 MB
2024/11/15 1:29 PM
image-2024-11-18-11-55-19-887.png
3 kB
2024/11/18 3:55 AM
mtv-1656-416-vm-662-slh8x.log
4.85 MB
2024/11/18 4:02 AM
forklift-controller-7976889cc4-2bp26.log
1.52 MB
2024/11/18 4:03 AM
image-2024-11-18-12-06-29-157.png
3 kB
2024/11/18 4:06 AM
image-2024-11-19-15-22-06-419.png
158 kB
2024/11/19 7:22 AM
image-2024-11-19-15-41-12-854.png
115 kB
2024/11/19 7:41 AM
image-2024-11-19-21-34-52-589.png
162 kB
2024/11/19 1:34 PM
image-2024-11-21-15-11-06-211.png
191 kB
2024/11/21 7:11 AM
image-2024-11-22-12-57-14-721.png
46 kB
2024/11/22 4:57 AM
image-2024-11-22-13-00-36-283.png
154 kB
2024/11/22 5:00 AM
cnv-51521-vm-662-kzxvt-virt-v2v.log
4.86 MB
2024/11/22 5:01 AM
forklift-controller-7d9c89c656-v4ggr-main.log
23.62 MB
2024/11/22 5:05 AM
cnv-51717-vm-678-crkzx-virt-v2v.log
4.87 MB
2024/11/25 11:28 AM
forklift-controller-7d9c89c656-8d649-main.log
3.23 MB
2024/11/25 11:29 AM

is related to

MTV-1681 [SPIKE] Investigate disabled snapshot consolidation

New

relates to

MTV-1679 XFS filesystem corruption after warm migration of VM from VMware (too many changed blocks)

Closed

split to

MTV-1664 Warm migration with pre-existing snapshot will not work

Verified

MTV-1679 XFS filesystem corruption after warm migration of VM from VMware (too many changed blocks)

Closed

links to

[KCS] [MTV] Filesystem corruptions after warm migrating a VM from VMware to OpenShift Virtualization

MTV-1656 | Disable snapshot consolidation

RHBA-2024:142279 MTV 2.7.4 Images

VMware warm migration - Fix snapshot cleanup

mentioned on

Merge request - Updated US source to: ed59cb4 Revert "MTV-1656 | Disable snapshot consolidation"

(3 links to, 1 mentioned on)

Details

Description

Description of problem:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates