Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-45460

[rhel10-beta][mlx vfio migration] The migration fails with "Received INIT_DATA_SENT but switchover ack is not used" error

    • libvirt-10.5.0-1.el10
    • None
    • Moderate
    • rhel-sst-virtualization
    • ssg_virtualization
    • 5
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      The mlx vfio migration fails after enabling switchover-ack capabilities

      Please provide the package NVR for which bug is seen:

      host:
      6.10.0-0.rc4.11.el10.x86_64
      qemu-kvm-9.0.0-2.el10.x86_64
      libvirt-10.4.0-1.el10.x86_64

      How reproducible:

      100%

      Steps to reproduce

      1. create a MT2910 VF and setup the VF for migration

      2. start a Q35 + OVMF VM with a mlx5_vfio_pci VF

      3. enable the switchover-ack capabilities

      /bin/virsh qemu-monitor-command --hmp rhel10 "migrate_set_capability return-path on"
      
      /bin/virsh qemu-monitor-command --hmp rhel10" migrate_set_capability switchover-ack on"
      
      /bin/virsh qemu-monitor-command --hmp rhel10 "info migrate_capabilities"
      xbzrle: off
      rdma-pin-all: off
      auto-converge: off
      zero-blocks: off
      compress: off
      events: on
      postcopy-ram: off
      x-colo: off
      release-ram: off
      return-path: on
      pause-before-switchover: off
      multifd: off
      dirty-bitmaps: off
      postcopy-blocktime: off
      late-block-activate: off
      x-ignore-shared: off
      validate-uuid: off
      background-snapshot: off
      zero-copy-send: off
      postcopy-preempt: off
      switchover-ack: on
      dirty-limit: off
      

      4. migrate the VM

      # /bin/virsh migrate --live --domain rhel10 --desturi qemu+ssh://10.73.212.96/system
      

      5. check the qemu-kvm log on the source host

      # cat /var/log/libvirt/qemu/rhel10.log
      2024-06-28 07:06:49.381+0000: initiating migration
      2024-06-28T07:06:49.596368Z qemu-kvm: failed to save SaveStateEntry with id(name): 3(ram): -5
      

      6. check the qemu-kvm log on the target host

      # cat /var/log/libvirt/qemu/rhel10.log
      ...
      2024-06-28T07:06:49.585869Z qemu-kvm: 0000:e1:00.1: Received INIT_DATA_SENT but switchover ack is not used
      2024-06-28T07:06:49.585933Z qemu-kvm: error while loading state section id 88(0000:00:02.3:00.0/vfio)
      2024-06-28T07:06:49.586415Z qemu-kvm: load of migration failed: Invalid argument
      2024-06-28 07:06:50.066+0000: shutting down, reason=crashed
      

      Expected results

      The mlx vfio migration finishes well

      Actual results

      The mlx vfio migration failed

      Additional info

      (1) The mellanox CX-7 device I used:

      # flint -d 0000:22:00.0 query full
      Image type:            FS4
      FW Version:            28.38.1002
      FW Release Date:       3.8.2023
      Part Number:           MCX75310AAS-HEA_Ax
      Description:           NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE / NDR200 IB (default mode); Single-port OSFP; PCIe 5.0 x16; Crypto Disabled; Secure Boot Enabled;
      Product Version:       28.38.1002
      Rom Info:              type=UEFI version=14.31.20 cpu=AMD64,AARCH64
                             type=PXE version=3.7.201 cpu=AMD64
      Description:           UID                GuidsNumber
      Base GUID:             946dae03001db182        2
      Base MAC:              946dae1db182            2
      Image VSD:             N/A
      Device VSD:            N/A
      PSID:                  MT_0000000844
      Security Attributes:   secure-fw
      Default Update Method: fw_ctrl
      Life cycle:            GA SECURED
      Secure Boot Capable:   Enabled
      EFUSE Security Ver:    0
      Image Security Ver:    0
      Security Ver Program:  Manually ; Disabled
      Encryption:            Enabled
      

      (2) How to create a MT2910 VF and setup the VF for migration

      1.1 load the mlx5_vfio_pci module
      
      # modprobe mlx5_vfio_pci
      
      1.2 create VF
      
      # sudo sh -c "echo 0 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs"
      # sudo sh -c "echo 1 > /sys/bus/pci/devices/0000:b1:00.0/sriov_numvfs"
      
      1.3 set VF mac
      
      # sudo sh -c  "ip link set ens2f0np0 vf 0 mac 52:54:00:01:01:01"
      
      1.4 unbind created VF from driver
      
      # sudo sh -c  "echo 0000:b1:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
      
      1.5 set switchdev mode on PF
      
      # sudo sh -c "devlink dev eswitch set pci/0000:b1:00.0 mode switchdev"
      # sudo sh -c "devlink dev eswitch show pci/0000:b1:00.0"
          pci/0000:b1:00.0: mode switchdev inline-mode none encap-mode basic
      
      1.6 enable VF's migration feature
      
      # sudo sh -c "devlink port function set pci/0000:b1:00.0/1 migratable enable"
      # sudo sh -c  "devlink port show pci/0000:b1:00.0/1"
      	  …
        function:
          hw_addr 52:54:00:01:01:01 roce enable migratable enable
      
      1.7 bind VF to mlx5_vfio_pci driver
      
      # sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/new_id"
      # sudo sh -c "echo '15b3 101e' > /sys/bus/pci/drivers/mlx5_vfio_pci/remove_id"
      # readlink -f /sys/bus/pci/devices/0000\:b1\:00.2/driver
        /sys/bus/pci/drivers/mlx5_vfio_pci
      

      (3) The mlx vfio migration finishes well without enabling the switchover-ack capability

      (4) libvirt will enable the return-path capability in default since libvirt-8.0 when migration.

      Additional info:
      (1) The auto cmd to reproduce this issue :

      # python3 /home/private_autocase/vfio/vfio_sriov_test.py --feature=vf --domain=$VM --device_name=MT2910-01 --machine_type=q35 --test_list="set_switchover_ack_in_mlx_device_migration"
      

              virt-maint virt-maint
              yanghliu@redhat.com YangHang Liu
              virt-maint virt-maint
              virt-bugs virt-bugs
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: