Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-67996

[vfio migration][aarch64][4k] "qemu-kvm: error while loading state section id 55" reported when migrating a vm with mlx5_vfio_pci VF

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • rhel-9.6
    • qemu-kvm
    • rhel-sst-virtualization
    • ssg_virtualization
    • 3
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • aarch64
    • None

      What were you trying to do that didn't work?

      Migrating a vm with a mlx5_vfio_pci VF reports below error:

      Migration: [ 0.43 %]error: internal error: QEMU unexpectedly closed the monitor (vm='avocado-vt-vm1'): 2024-11-19T01:19:25.726386Z qemu-kvm: error while loading state section id 55(0000:00:01.0:00.0/vfio)
      2024-11-19T01:19:25.726730Z qemu-kvm: load of migration failed: Invalid argument
      

      Please provide the package NVR for which the bug is seen:

      libvirt-10.9.0-1.el9.aarch64
      qemu-kvm-9.1.0-2.el9.aarch64
      edk2-aarch64-20240524-9.el9.noarch
      kernel-5.14.0-528.el9.aarch64
      source host: nvidia-grace-hopper-06
      source host' FW version: 28.43.1014
      source iface: 0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
      destination host: nvidia-grace-hopper-09
      destination host's FW version: 28.39.1002
      destination iface: 0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

      How reproducible is this bug?:

      100%

      Steps to reproduce

      1. Enable 2 vfs and set them to migratable on both source and destination host(refer to polarion case VIRT-299412)
        mlxconfig -d 0000:01:00.1  query VF_MIGRATION_MODE
        echo 2 > /sys/devices/pci0000:00/0000:00:00.0/0000:01:00.1/sriov_numvfs
        ip link set enp1s0f1np1 vf 0 mac 52:54:00:01:01:01
        echo 0000:01:02.2 > /sys/bus/pci/drivers/mlx5_core/unbind
        echo 0000:01:02.3 > /sys/bus/pci/drivers/mlx5_core/unbind
        devlink dev eswitch set pci/0000:01:00.1 mode switchdev
        devlink dev eswitch show pci/0000:01:00.1
        devlink port
        devlink port function set pci/0000:01:00.1/65537 migratable enable 
        devlink port function set pci/0000:01:00.1/65538 migratable enable 
        devlink port
        modprobe mlx5_vfio_pci
        virsh nodedev-detach pci_0000_01_02_2 --driver mlx5_vfio_pci
        virsh nodedev-detach pci_0000_01_02_3 --driver mlx5_vfio_pci
        
      2. Start a vm with mlx5_vfio_pci VF
            <hostdev mode='subsystem' type='pci' managed='yes'>
              <driver name='vfio'/>
              <source>
                <address domain='0x0000' bus='0x01' slot='0x02' function='0x2'/>
              </source>
              <alias name='ua-1bcbabff-f022-4d4f-ae8c-13f2d3a07906'/>
              <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
            </hostdev>
        
      3. virsh start <vm>
      4. virsh migrate --live --verbose --domain <vm> --desturi qemu+tcp://<dest ip>/system

      Expected results

      VM should be migrated to destination host.

      Actual results

      It reports an erorr:

      # virsh migrate --live --verbose --domain avocado-vt-vm1 --desturi qemu+tcp://10.26.1.121/system
      Migration: [ 0.43 %]error: internal error: QEMU unexpectedly closed the monitor (vm='avocado-vt-vm1'): 2024-11-19T01:19:25.726386Z qemu-kvm: error while loading state section id 55(0000:00:01.0:00.0/vfio)
      
      

      tail -f avocado-vt-vm1.log:

      2024-11-19T02:14:25.206085Z qemu-kvm: failed to save SaveStateEntry with id(name): 3(ram): -5
      2024-11-19T02:14:25.268250Z qemu-kvm: Unable to shutdown socket: Transport endpoint is not connected
      2024-11-19T02:14:25.268276Z qemu-kvm: Sibling indicated error 1
      

      dmesg on destination host:

      # [ 4429.528363] mlx5_vfio_pci 0000:01:02.2: enabli
      ng device (0000 -> 0002)
      [ 4429.998017] mlx5_core 0000:01:00.1: mlx5_cmd_out_err:808:(pid 28296): LOAD_VHCA
      _STATE(0x119) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xe9ecae), 
      err(-22)
      

              virt-maint virt-maint
              yicui1 Yingshun Cui
              virt-maint virt-maint
              virt-bugs virt-bugs
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: