-
Bug
-
Resolution: Done
-
Normal
-
None
-
rhel-10.0.beta
-
libvirt-10.5.0-1.el10
-
None
-
Moderate
-
rhel-sst-virtualization
-
ssg_virtualization
-
5
-
False
-
-
None
-
Red Hat Enterprise Linux
-
None
-
None
-
Automated
-
-
x86_64
-
None
What were you trying to do that didn't work?
The mlx vfio migration fails after enabling switchover-ack capabilities
Please provide the package NVR for which bug is seen:
host:
6.10.0-0.rc4.11.el10.x86_64
qemu-kvm-9.0.0-2.el10.x86_64
libvirt-10.4.0-1.el10.x86_64
How reproducible:
100%
Steps to reproduce
1. create a MT2910 VF and setup the VF for migration
2. start a Q35 + OVMF VM with a mlx5_vfio_pci VF
3. enable the switchover-ack capabilities
/bin/virsh qemu-monitor-command --hmp rhel10 "migrate_set_capability return-path on"
/bin/virsh qemu-monitor-command --hmp rhel10" migrate_set_capability switchover-ack on"
/bin/virsh qemu-monitor-command --hmp rhel10 "info migrate_capabilities" xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: on postcopy-ram: off x-colo: off release-ram: off return-path: on pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off validate-uuid: off background-snapshot: off zero-copy-send: off postcopy-preempt: off switchover-ack: on dirty-limit: off
4. migrate the VM
# /bin/virsh migrate --live --domain rhel10 --desturi qemu+ssh://10.73.212.96/system
5. check the qemu-kvm log on the source host
# cat /var/log/libvirt/qemu/rhel10.log 2024-06-28 07:06:49.381+0000: initiating migration 2024-06-28T07:06:49.596368Z qemu-kvm: failed to save SaveStateEntry with id(name): 3(ram): -5
6. check the qemu-kvm log on the target host
# cat /var/log/libvirt/qemu/rhel10.log ... 2024-06-28T07:06:49.585869Z qemu-kvm: 0000:e1:00.1: Received INIT_DATA_SENT but switchover ack is not used 2024-06-28T07:06:49.585933Z qemu-kvm: error while loading state section id 88(0000:00:02.3:00.0/vfio) 2024-06-28T07:06:49.586415Z qemu-kvm: load of migration failed: Invalid argument 2024-06-28 07:06:50.066+0000: shutting down, reason=crashed
Expected results
The mlx vfio migration finishes well
Actual results
The mlx vfio migration failed
Additional info
(1) The mellanox CX-7 device I used:
# flint -d 0000:22:00.0 query full Image type: FS4 FW Version: 28.38.1002 FW Release Date: 3.8.2023 Part Number: MCX75310AAS-HEA_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE / NDR200 IB (default mode); Single-port OSFP; PCIe 5.0 x16; Crypto Disabled; Secure Boot Enabled; Product Version: 28.38.1002 Rom Info: type=UEFI version=14.31.20 cpu=AMD64,AARCH64 type=PXE version=3.7.201 cpu=AMD64 Description: UID GuidsNumber Base GUID: 946dae03001db182 2 Base MAC: 946dae1db182 2 Image VSD: N/A Device VSD: N/A PSID: MT_0000000844 Security Attributes: secure-fw Default Update Method: fw_ctrl Life cycle: GA SECURED Secure Boot Capable: Enabled EFUSE Security Ver: 0 Image Security Ver: 0 Security Ver Program: Manually ; Disabled Encryption: Enabled
(2) How to create a MT2910 VF and setup the VF for migration
1.1 load the mlx5_vfio_pci module # modprobe mlx5_vfio_pci 1.2 create VF # sudo sh -c "echo 0 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs" # sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs" 1.3 set VF mac # sudo sh -c "ip link set ens2f0np0 vf 0 mac 52:54:00:01:01:01" 1.4 unbind created VF from driver # sudo sh -c "echo 0000:b1:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind" 1.5 set switchdev mode on PF # sudo sh -c "devlink dev eswitch set pci/0000:b1:00.0 mode switchdev" # sudo sh -c "devlink dev eswitch show pci/0000:b1:00.0" pci/0000:b1:00.0: mode switchdev inline-mode none encap-mode basic 1.6 enable VF's migration feature # sudo sh -c "devlink port function set pci/0000:b1:00.0/1 migratable enable" # sudo sh -c "devlink port show pci/0000:b1:00.0/1" … function: hw_addr 52:54:00:01:01:01 roce enable migratable enable 1.7 bind VF to mlx5_vfio_pci driver # sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/new_id" # sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/remove_id" # readlink -f /sys/bus/pci/devices/0000\:b1\:00.2/driver /sys/bus/pci/drivers/mlx5_vfio_pci
(3) The mlx vfio migration finishes well without enabling the switchover-ack capability
(4) libvirt will enable the return-path capability in default since libvirt-8.0 when migration.
Additional info:
(1) The auto cmd to reproduce this issue :
# python3 /home/private_autocase/vfio/vfio_sriov_test.py --feature=vf --domain=$VM --device_name=MT2910-01 --machine_type=q35 --test_list="set_switchover_ack_in_mlx_device_migration"
- clones
-
RHEL-15002 [mlx vfio migration] The migration fails with "Received INIT_DATA_SENT but switchover ack is not used" error
- Closed