Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-20284

[vfio migration][seabios-1.16.3-1.el9] The Q35 + SEABIOS VM with a mlx VF can not be migrated

    • qemu-kvm-8.2.0-1.el9
    • Yes
    • Critical
    • TestOnly, Regression
    • rhel-sst-virtualization
    • ssg_virtualization
    • None
    • QE ack
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • x86_64
    • Unspecified
    • None

      What were you trying to do that didn't work?
      The Q35 + SEABIOS VM with a mlx5_vfio_pci VF can not be migrated when using seabios-1.16.3-1.el9

      Version-Release number of selected component (if applicable):
      host:
      5.14.0-402.el9.x86_64
      qemu-kvm-8.1.0-4.el9.x86_64
      libvirt-9.9.0-1.el9.x86_64
      seabios-1.16.3-1.el9
      VM:
      5.14.0-402.el9.x86_64

      How reproducible:
      100%

      Steps to reproduce
      1. create a MT2910 VF and setup the VF for migration

      2. start a Q35 + SEABIOS VM with a mlx5_vfio_pci VF

      3. check the MT2910 VF in the VM

      4. migrate the VM

      # virsh migrate --live --verbose --domain rhel94 --desturi qemu+ssh://10.73.212.98/system
      

      The migration process is hung

      related qmp:

      > {"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-448"} (fd=21)
      <  {"return": {}, "id": "libvirt-448"}
      > {"execute":"migrate","arguments":{"detach":true,"resume":false,"uri":"fd:migrate"},"id":"libvirt-449"}
      !  {"timestamp": {"seconds": 1703741835, "microseconds": 924774}, "event": "MIGRATION", "data": {"status": "setup"}}
      <  {"return": {}, "id": "libvirt-449"}
      !  {"timestamp": {"seconds": 1703741835, "microseconds": 931893}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
      !  {"timestamp": {"seconds": 1703741835, "microseconds": 965797}, "event": "MIGRATION", "data": {"status": "active"}}
      !  {"timestamp": {"seconds": 1703741835, "microseconds": 965844}, "event": "MIGRATION", "data": {"status": "failed"}}
      > {"execute":"query-migrate","id":"libvirt-450"} <-- The migration process is hung here
      

      5. Destroying the VM and then check the qemu-kvm log

      ...
      2023-12-28 05:08:46.219+0000: initiating migration
      2023-12-28T05:08:46.226485Z qemu-kvm: 0000:e1:00.1: Failed to start DMA logging, err -95 (Operation not supported)
      2023-12-28T05:08:46.226653Z qemu-kvm: vfio: Could not start dirty page tracking, err: -95 (Operation not supported)
      2023-12-28 05:54:02.734+0000: shutting down, reason=destroyed
      

      Expected results
      The Q35 + SEABIOS VM with a mlx VF can be migrated

      Actual results
      The Q35 + SEABIOS VM with a mlx VF can not be migrated

            [RHEL-20284] [vfio migration][seabios-1.16.3-1.el9] The Q35 + SEABIOS VM with a mlx VF can not be migrated

            The 'blocked by' issue RHEL-7098 is transitioned to Closed.

            RHEL Jira bot added a comment - The 'blocked by' issue RHEL-7098 is transitioned to Closed.

            rh-ee-clegoate Thanks for the confirmation : )

            I have updated the keyword and label as TestOnly.

            YangHang Liu added a comment - rh-ee-clegoate Thanks for the confirmation : ) I have updated the keyword and label as TestOnly.

            I plan to keep this issue open to track my Q35 SEABIOS VM + mlx vfio migration test.
            (It's a regression issue after seabios-bin-1.16.3-1 but is likely to has the same root cause as the edk2-ovmf issue in RHEL-7098 )

            Yes. both issues are related to the dynamic PCI MMIO window change in FW.

            May I ask what's your opinion of how can we deal with this issue :
            [1] move this issue to "In progress" and request the "Preliminary Testing"
            [2] mark this issue as Testonly and move to "ClOSED" directly once my verification finishes

            I would choose [2] since the fix is already addressed in RHEL-7098

             

            Cédric Le Goater added a comment - I plan to keep this issue open to track my Q35 SEABIOS VM + mlx vfio migration test. (It's a regression issue after seabios-bin-1.16.3-1 but is likely to has the same root cause as the edk2-ovmf issue in RHEL-7098 ) Yes. both issues are related to the dynamic PCI MMIO window change in FW. May I ask what's your opinion of how can we deal with this issue : [1] move this issue to "In progress" and request the "Preliminary Testing" [2] mark this issue as Testonly and move to "ClOSED" directly once my verification finishes I would choose [2] since the fix is already addressed in RHEL-7098  

            rh-ee-clegoate

            I plan to keep this issue open to track my Q35 SEABIOS VM + mlx vfio migration test.
            (It's a regression issue after seabios-bin-1.16.3-1 but is likely to has the same root cause as the edk2-ovmf issue in RHEL-7098 )

            May I ask what's your opinion of how can we deal with this issue :
            [1] move this issue to "In progress" and request the "Preliminary Testing"
            [2] mark this issue as Testonly and move to "ClOSED" directly once my verification finishes

            YangHang Liu added a comment - rh-ee-clegoate I plan to keep this issue open to track my Q35 SEABIOS VM + mlx vfio migration test. (It's a regression issue after seabios-bin-1.16.3-1 but is likely to has the same root cause as the edk2-ovmf issue in RHEL-7098 ) May I ask what's your opinion of how can we deal with this issue : [1] move this issue to "In progress" and request the "Preliminary Testing" [2] mark this issue as Testonly and move to "ClOSED" directly once my verification finishes

            YangHang Liu added a comment - - edited

            Test env:
            host:
            source: dell-per7625-01.lab.eng.pek2.redhat.com
            taget: dell-per7625-02.lab.eng.pek2.redhat.com
            5.14.0-402.el9.x86_64
            qemu-kvm-8.2.0-1.el9.x86_64
            seabios-bin-1.16.3-1.el9.noarch
            VM
            5.14.0-402.el9.x86_64

            Test result: PASS

            Test step:

            1. create a MT2910 VFs and setup the VF for vfio migration on source host

            2. start a Q35 + SEABIOS VM with two mlx VFs on the source host

                <os>
                  <type arch='x86_64' machine='pc-q35-rhel9.2.0'>hvm</type>
                  <boot dev='hd'/>
                </os>
                <hostdev mode='subsystem' type='pci' managed='no'>
                  <driver name='vfio'/>
                  <source>
                    <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/>
                  </source>
                </hostdev>
                  <hostdev mode='subsystem' type='pci' managed='no'>
                  <driver name='vfio'/>
                  <source>
                    <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/>
                  </source>
                </hostdev>
            

            3. configure an IP address for the mlx VF

            # ifconfig enp4s0 192.168.150.100
            

            4. migrate a VM into a file

            # /bin/virsh managedsave  rhel94 --verbose
            Managedsave: [100.00 %]
            
            # /bin/virsh domjobinfo rhel94 --completed
            Job type:         Completed
            Operation:        Save
            Time elapsed:     648          ms
            Data processed:   524.519 MiB
            Data remaining:   0.000 B
            Data total:       4.016 GiB
            Memory processed: 524.519 MiB
            Memory remaining: 0.000 B
            Memory total:     4.016 GiB
            Memory bandwidth: 1022.455 MiB/s
            Dirty rate:       0            pages/s
            Page size:        4096         bytes
            Iteration:        3
            Postcopy requests: 0
            Constant pages:   923293
            Normal pages:     129589
            Normal data:      506.207 MiB
            Total downtime:   591          ms
            Setup time:       67           ms
            
            

            5. restore a VM from a file and check the mlx VF IP address

            # ifconfig
            enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
                    inet 192.168.150.100  netmask 255.255.255.0  broadcast 192.168.150.255
                    inet6 fe80::c7ef:5ce:ac9f:d644  prefixlen 64  scopeid 0x20<link>
                    ether 52:54:00:35:11:cf  txqueuelen 1000  (Ethernet)
                    RX packets 3  bytes 1005 (1005.0 B)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 19  bytes 3362 (3.2 KiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            enp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
                    inet6 fe80::d8a2:581:83ac:efa  prefixlen 64  scopeid 0x20<link>
                    ether 52:54:00:5d:88:1f  txqueuelen 1000  (Ethernet)
                    RX packets 3  bytes 1005 (1005.0 B)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 22  bytes 3860 (3.7 KiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            

            YangHang Liu added a comment - - edited Test env: host: source: dell-per7625-01.lab.eng.pek2.redhat.com taget: dell-per7625-02.lab.eng.pek2.redhat.com 5.14.0-402.el9.x86_64 qemu-kvm-8.2.0-1.el9.x86_64 seabios-bin-1.16.3-1.el9.noarch VM 5.14.0-402.el9.x86_64 Test result: PASS Test step: 1. create a MT2910 VFs and setup the VF for vfio migration on source host 2. start a Q35 + SEABIOS VM with two mlx VFs on the source host <os> <type arch='x86_64' machine='pc-q35-rhel9.2.0'>hvm</type> <boot dev='hd'/> </os> <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/> </source> </hostdev> 3. configure an IP address for the mlx VF # ifconfig enp4s0 192.168.150.100 4. migrate a VM into a file # /bin/virsh managedsave rhel94 --verbose Managedsave: [100.00 %] # /bin/virsh domjobinfo rhel94 --completed Job type: Completed Operation: Save Time elapsed: 648 ms Data processed: 524.519 MiB Data remaining: 0.000 B Data total: 4.016 GiB Memory processed: 524.519 MiB Memory remaining: 0.000 B Memory total: 4.016 GiB Memory bandwidth: 1022.455 MiB/s Dirty rate: 0 pages/s Page size: 4096 bytes Iteration: 3 Postcopy requests: 0 Constant pages: 923293 Normal pages: 129589 Normal data: 506.207 MiB Total downtime: 591 ms Setup time: 67 ms 5. restore a VM from a file and check the mlx VF IP address # ifconfig enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.150.100 netmask 255.255.255.0 broadcast 192.168.150.255 inet6 fe80::c7ef:5ce:ac9f:d644 prefixlen 64 scopeid 0x20<link> ether 52:54:00:35:11:cf txqueuelen 1000 (Ethernet) RX packets 3 bytes 1005 (1005.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 19 bytes 3362 (3.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::d8a2:581:83ac:efa prefixlen 64 scopeid 0x20<link> ether 52:54:00:5d:88:1f txqueuelen 1000 (Ethernet) RX packets 3 bytes 1005 (1005.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 22 bytes 3860 (3.7 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

            YangHang Liu added a comment - - edited

            Test env:
            host:
            source: dell-per7625-01.lab.eng.pek2.redhat.com
            taget: dell-per7625-02.lab.eng.pek2.redhat.com
            5.14.0-402.el9.x86_64
            qemu-kvm-8.2.0-1.el9.x86_64
            viseabios-bin-1.16.3-1.el9.noarch
            VM
            5.14.0-402.el9.x86_64

            Test result: PASS

            Test step:

            1. create a MT2910 VFs and setup the VF for vfio migration on source host

            2. create a MT2910 VFs and setup the VF for vfio migration on target host

            3. start a Q35 + SEABIOS VM with a mlx VF on the source host

            The xml:

              <os>
                <type arch='x86_64' machine='pc-q35-rhel9.4.0'>hvm</type>
                <boot dev='hd'/>
              </os>
               ... 
               <hostdev mode='subsystem' type='pci' managed='no'>
                  <driver name='vfio'/>
                  <source>
                    <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/>
                  </source>
                  <alias name='hostdev0'/>
                </hostdev>
            

            4. migrate the VM

            # /bin/virsh migrate --live --verbose --domain rhel94 --desturi qemu+ssh://10.73.212.96/system
            Migration: [100.00 %]
            

            5. check the migration status on the source host

            # tail -f /var/log/libvirt/qemu/rhel94.log
            2024-01-04 13:31:55.601+0000: initiating migration
            2024-01-04 13:32:00.851+0000: shutting down, reason=migrated
            
            # /bin/virsh domjobinfo rhel94 --completed
            Job type:         Completed
            Operation:        Outgoing migration
            Time elapsed:     6398         ms
            Data processed:   541.504 MiB
            Data remaining:   0.000 B
            Data total:       4.016 GiB
            Memory processed: 541.504 MiB
            Memory remaining: 0.000 B
            Memory total:     4.016 GiB
            Memory bandwidth: 107.977 MiB/s
            Dirty rate:       0            pages/s
            Page size:        4096         bytes
            Iteration:        5
            Postcopy requests: 0
            Constant pages:   918913
            Normal pages:     135074
            Normal data:      527.633 MiB
            Total downtime:   242          ms
            Setup time:       37           ms
            

            6. migrate the VM back

            # /bin/virsh domjobinfo rhel94 --completed
            Job type:         Completed
            Operation:        Incoming migration
            Time elapsed:     6649         ms
            Time elapsed w/o network: 6644         ms
            Data processed:   555.549 MiB
            Data remaining:   0.000 B
            Data total:       4.016 GiB
            Memory processed: 555.549 MiB
            Memory remaining: 0.000 B
            Memory total:     4.016 GiB
            Memory bandwidth: 105.297 MiB/s
            Dirty rate:       0            pages/s
            Page size:        4096         bytes
            Iteration:        4
            Postcopy requests: 0
            Constant pages:   919062
            Normal pages:     138662
            Normal data:      541.648 MiB
            Total downtime:   377          ms
            Downtime w/o network: 372          ms
            Setup time:       44           ms
            

            7. check the VM status on the VM

            # ifconifg or lspci  ← We can get the  VF info via ifconfig or lspci
            # dmesg  ← There is no error in the VM dmesg
            

            YangHang Liu added a comment - - edited Test env: host: source: dell-per7625-01.lab.eng.pek2.redhat.com taget: dell-per7625-02.lab.eng.pek2.redhat.com 5.14.0-402.el9.x86_64 qemu-kvm-8.2.0-1.el9.x86_64 viseabios-bin-1.16.3-1.el9.noarch VM 5.14.0-402.el9.x86_64 Test result: PASS Test step: 1. create a MT2910 VFs and setup the VF for vfio migration on source host 2. create a MT2910 VFs and setup the VF for vfio migration on target host 3. start a Q35 + SEABIOS VM with a mlx VF on the source host The xml: <os> <type arch='x86_64' machine='pc-q35-rhel9.4.0'>hvm</type> <boot dev='hd'/> </os> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0xe1' slot='0x00' function='0x1'/> </source> <alias name='hostdev0'/> </hostdev> 4. migrate the VM # /bin/virsh migrate --live --verbose --domain rhel94 --desturi qemu+ssh://10.73.212.96/system Migration: [100.00 %] 5. check the migration status on the source host # tail -f /var/log/libvirt/qemu/rhel94.log 2024-01-04 13:31:55.601+0000: initiating migration 2024-01-04 13:32:00.851+0000: shutting down, reason=migrated # /bin/virsh domjobinfo rhel94 --completed Job type: Completed Operation: Outgoing migration Time elapsed: 6398 ms Data processed: 541.504 MiB Data remaining: 0.000 B Data total: 4.016 GiB Memory processed: 541.504 MiB Memory remaining: 0.000 B Memory total: 4.016 GiB Memory bandwidth: 107.977 MiB/s Dirty rate: 0 pages/s Page size: 4096 bytes Iteration: 5 Postcopy requests: 0 Constant pages: 918913 Normal pages: 135074 Normal data: 527.633 MiB Total downtime: 242 ms Setup time: 37 ms 6. migrate the VM back # /bin/virsh domjobinfo rhel94 --completed Job type: Completed Operation: Incoming migration Time elapsed: 6649 ms Time elapsed w/o network: 6644 ms Data processed: 555.549 MiB Data remaining: 0.000 B Data total: 4.016 GiB Memory processed: 555.549 MiB Memory remaining: 0.000 B Memory total: 4.016 GiB Memory bandwidth: 105.297 MiB/s Dirty rate: 0 pages/s Page size: 4096 bytes Iteration: 4 Postcopy requests: 0 Constant pages: 919062 Normal pages: 138662 Normal data: 541.648 MiB Total downtime: 377 ms Downtime w/o network: 372 ms Setup time: 44 ms 7. check the VM status on the VM # ifconifg or lspci ← We can get the VF info via ifconfig or lspci # dmesg ← There is no error in the VM dmesg

            Related: https://issues.redhat.com/browse/RHEL-7098
            Very likely the same root cause.

            The qemu-kvm 8.2 with the fix of RHEL-7098 has come out : )

            I will have a try and update the test result in JIRA this week.

            YangHang Liu added a comment - Related: https://issues.redhat.com/browse/RHEL-7098 Very likely the same root cause. The qemu-kvm 8.2 with the fix of RHEL-7098 has come out : ) I will have a try and update the test result in JIRA this week.

            Related: https://issues.redhat.com/browse/RHEL-7098
            Very likely the same root cause.

            Gerd Hoffmann added a comment - Related: https://issues.redhat.com/browse/RHEL-7098 Very likely the same root cause.

            Hi yanghliu@redhat.com,

            Have you used the same phys-bits between src host and dst host? May be you can retry them with the parameters "host-phys-bits=on,host-phys-bits-limit=*".  Thanks.

            Xueqiang Wei added a comment - Hi  yanghliu@redhat.com , Have you used the same phys-bits between src host and dst host? May be you can retry them with the parameters "host-phys-bits=on,host-phys-bits-limit=*".  Thanks.

            YangHang Liu added a comment - - edited

            It's a regression issue which is introduced since seabios-1.16.3-1.el9 by RHEL-7112 [seabios] dynamic mmio window

            YangHang Liu added a comment - - edited It's a regression issue which is introduced since seabios-1.16.3-1.el9 by RHEL-7112 [seabios] dynamic mmio window

              rh-ee-clegoate Cédric Le Goater
              yanghliu@redhat.com YangHang Liu
              virt-maint virt-maint
              YangHang Liu YangHang Liu
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: