Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-109222

[RHEL 9.7] Libvirt managed="yes" VM blocks host connection when vfio-pci driver unset back to host

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • rhel-10.1, rhel-9.7
    • libvirt
    • Yes
    • Critical
    • rhel-virt-core-libvirt-1
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • All
    • None

      What were you trying to do that didn't work?

      When using the managed="yes" attribute for an SR-IOV Virtual Function (VF) in a libvirt VM's XML configuration, 
      VM teardown is causing a complete and unrecoverable loss of the host's network connectivity.

      Sometimes fails earlier - no clue why.

      What is the impact of this issue to you?

      After the VM is destroyed, the host machine's network connection is permanently lost. cannot be returned to beaker.

      It is not possible to ssh there from outside network

      It is possible to get there via ssh .. but only from machine in same subdomain e.g. 10.6.8.* - but not able to get outside.

      Many jenkins jobs failing with error (channel connection lost) - see e.g:
      https://libvirt-rhos-jenkins-product.hosted.upshift.rdu2.redhat.com/job/libvirt-RHEL-9.7-runtest-aarch64-function-viommu/44/console

      impacting both aarch64 & x86_64 tests (also *sriov and some virtual_network)

       

      Please provide the package NVR for which the bug is seen:

      libvirt libvirt-10.10.0-14.el9.aarch64
      qemu-kvm qemu-kvm-9.1.0-25.el9.aarch64
      kernel kernel-5.14.0-604.el9.aarch64+64k

      How reproducible is this bug?: 90%

      (for some reason sometimes it will not fail ..
      and also the machine below is working almost without issue ...
      ampere-mtsnow-altramax-63.lab.eng.rdu2.redhat.com

      I was able to do those operations many times on aampere*63 it manually (also trigger tests manually etc) but when I've tried to run it from jenkins there the test it failed again:
      https://libvirt-rhos-jenkins-product.hosted.upshift.rdu2.redhat.com/job/libvirt-RHEL-9.7-runtest-aarch64-function-viommu/49/ 

      Steps to Reproduce

       

      1. Enable SR-IOV on a supported network card (e.g., Intel I350) and create one or more VFs on the host.
      echo 4 > /sys/devices/pci0002:00/0002:00:01.0/0002:01:00.0/sriov_numvfs 
      1. set vfio-pci driver by:
      sudo driverctl set-override 0002:01:10.0 vfio-pci
      1. Create a VM with an interface configured for PCI passthrough using type="hostdev" and managed="yes" on a VF. (see avo.xml attached)
      1. Start the VM using virsh start <vm_name>.
      1. On the host, check the driver in use for the VF with lspci -k.
      1. Shut down the VM using virsh destroy <vm_name>.
      2. return the driver:
      sudo driverctl unset-override 0002:01:10.0

      Observed Behavior

      • After the VM is destroyed,  and the driver unset - the host machine's network connection is permanently lost. 
      • ... sometimes it happen earlier ... e.g. just after destroying the guest. 

      Expected Behavior

      • The host machine's network connection should remain stable and functional after the VM is destroyed.

      Bug description created with help of Gemini. 
      example of VM is attached.

      Additional info:
      There are no errors in dmesg, journalctl,

        1. avo.xml
          9 kB
          Hana Holoubkova
        2. avocado-vt-vm1.log
          22 kB
          Hana Holoubkova
        3. messages
          563 kB
          Hana Holoubkova
        4. virtqemud.log.gz
          804 kB
          Hana Holoubkova

              lstump@redhat.com Laine Stump
              rh-ee-hholoubk Hana Holoubkova
              Laine Stump Laine Stump
              Meina Li Meina Li
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: