-
Bug
-
Resolution: Done
-
Normal
-
None
-
rhel-10.0.beta
-
libvirt-10.5.0-1.el10
-
None
-
Moderate
-
rhel-virt-core-libvirt-1
-
ssg_virtualization
-
5
-
False
-
False
-
-
None
-
Red Hat Enterprise Linux
-
None
-
None
-
Automated
-
-
x86_64
-
None
What were you trying to do that didn't work?
The mlx vfio migration fails after enabling switchover-ack capabilities
Please provide the package NVR for which bug is seen:
host:
6.10.0-0.rc4.11.el10.x86_64
qemu-kvm-9.0.0-2.el10.x86_64
libvirt-10.4.0-1.el10.x86_64
How reproducible:
100%
Steps to reproduce
1. create a MT2910 VF and setup the VF for migration
2. start a Q35 + OVMF VM with a mlx5_vfio_pci VF
3. enable the switchover-ack capabilities
/bin/virsh qemu-monitor-command --hmp rhel10 "migrate_set_capability return-path on"
/bin/virsh qemu-monitor-command --hmp rhel10" migrate_set_capability switchover-ack on"
/bin/virsh qemu-monitor-command --hmp rhel10 "info migrate_capabilities" xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: on postcopy-ram: off x-colo: off release-ram: off return-path: on pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off validate-uuid: off background-snapshot: off zero-copy-send: off postcopy-preempt: off switchover-ack: on dirty-limit: off
4. migrate the VM
# /bin/virsh migrate --live --domain rhel10 --desturi qemu+ssh://10.73.212.96/system
5. check the qemu-kvm log on the source host
# cat /var/log/libvirt/qemu/rhel10.log 2024-06-28 07:06:49.381+0000: initiating migration 2024-06-28T07:06:49.596368Z qemu-kvm: failed to save SaveStateEntry with id(name): 3(ram): -5
6. check the qemu-kvm log on the target host
# cat /var/log/libvirt/qemu/rhel10.log ... 2024-06-28T07:06:49.585869Z qemu-kvm: 0000:e1:00.1: Received INIT_DATA_SENT but switchover ack is not used 2024-06-28T07:06:49.585933Z qemu-kvm: error while loading state section id 88(0000:00:02.3:00.0/vfio) 2024-06-28T07:06:49.586415Z qemu-kvm: load of migration failed: Invalid argument 2024-06-28 07:06:50.066+0000: shutting down, reason=crashed
Expected results
The mlx vfio migration finishes well
Actual results
The mlx vfio migration failed
Additional info
(1) The mellanox CX-7 device I used:
# flint -d 0000:22:00.0 query full
Image type: FS4
FW Version: 28.38.1002
FW Release Date: 3.8.2023
Part Number: MCX75310AAS-HEA_Ax
Description: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE / NDR200 IB (default mode); Single-port OSFP; PCIe 5.0 x16; Crypto Disabled; Secure Boot Enabled;
Product Version: 28.38.1002
Rom Info: type=UEFI version=14.31.20 cpu=AMD64,AARCH64
type=PXE version=3.7.201 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 946dae03001db182 2
Base MAC: 946dae1db182 2
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000844
Security Attributes: secure-fw
Default Update Method: fw_ctrl
Life cycle: GA SECURED
Secure Boot Capable: Enabled
EFUSE Security Ver: 0
Image Security Ver: 0
Security Ver Program: Manually ; Disabled
Encryption: Enabled
(2) How to create a MT2910 VF and setup the VF for migration
1.1 load the mlx5_vfio_pci module
# modprobe mlx5_vfio_pci
1.2 create VF
# sudo sh -c "echo 0 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs"
# sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs"
1.3 set VF mac
# sudo sh -c "ip link set ens2f0np0 vf 0 mac 52:54:00:01:01:01"
1.4 unbind created VF from driver
# sudo sh -c "echo 0000:b1:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
1.5 set switchdev mode on PF
# sudo sh -c "devlink dev eswitch set pci/0000:b1:00.0 mode switchdev"
# sudo sh -c "devlink dev eswitch show pci/0000:b1:00.0"
pci/0000:b1:00.0: mode switchdev inline-mode none encap-mode basic
1.6 enable VF's migration feature
# sudo sh -c "devlink port function set pci/0000:b1:00.0/1 migratable enable"
# sudo sh -c "devlink port show pci/0000:b1:00.0/1"
…
function:
hw_addr 52:54:00:01:01:01 roce enable migratable enable
1.7 bind VF to mlx5_vfio_pci driver
# sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/new_id"
# sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/remove_id"
# readlink -f /sys/bus/pci/devices/0000\:b1\:00.2/driver
/sys/bus/pci/drivers/mlx5_vfio_pci
(3) The mlx vfio migration finishes well without enabling the switchover-ack capability
(4) libvirt will enable the return-path capability in default since libvirt-8.0 when migration.
Additional info:
(1) The auto cmd to reproduce this issue :
# python3 /home/private_autocase/vfio/vfio_sriov_test.py --feature=vf --domain=$VM --device_name=MT2910-01 --machine_type=q35 --test_list="set_switchover_ack_in_mlx_device_migration"
- clones
-
RHEL-15002 [mlx vfio migration] The migration fails with "Received INIT_DATA_SENT but switchover ack is not used" error
-
- Closed
-