Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-88141

[NVIDIA Corporation GA100] gpu device cannot be hot-unplugged, if nvidia gpu driver is installed

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • rhel-10.1
    • qemu-kvm / Devices
    • rhel-virt-hwe-arm-1
    • ssg_virtualization
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • Split items
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • aarch64
    • None

      What were you trying to do that didn't work?

      It failed to hot-unplug gpu device in vm with nvidia gpu data centre driver.

      Please provide the package NVR for which the bug is seen:

      Guest:
      Kernel 6.12.0-74.el10.aarch64+64k
      NVIDIA-SMI 570.133.20
      Driver Version: 570.133.20

      Host:
      libvirt-10.10.0-8.1.el10_0.aarch64
      qemu-kvm-9.1.0-15.el10_0.1.aarch64
      6.12.0-55.2.1.el10_0.aarch64+64k

      How reproducible is this bug?:

      100%

      Steps to reproduce

      1.  Disable nouveau in guest
      2. Start a vm with gpu passthrough(or hotplug gpu to a running vm)
      3.  Install driver(follow https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#nvidia-kernel-modules)
        OR:
        1. dnf install -y gcc kernel*headers* kernel*devel*
        2. dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
        3. dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/sbsa/cuda-rhel9.repo
        4. dnf -y module install nvidia-driver:open-dkms
      4. Detach the device
        # cat  ~/xml_files/hostdev_simple.xml 
        <hostdev mode="subsystem" type="pci" managed="yes">
          <source>
            <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
          </source>
        </hostdev>
        
        # virsh detach-device avocado-vt-vm1 ~/xml_files/hostdev_simple.xml
        

      # check guest's dmesg and run 'lspci|grep NVIDIA'

      1. run 'reboot' or 'fwupdmgr get-devices' in guest

      Expected results

      There is no error in dmesg
      The command 'lscpi|grep NVIDIA' doesn't return the removed gpu device.
       

      Actual results

      The device cannot be removed.

      [root@localhost ~]# [  943.369462] pcieport 0000:00:01.7: pciehp: Slot(0-7): Button press: will power off in 5 sec
      [  948.412415] NVRM: Attempting to remove device 0000:08:00.0 with non-zero usage count!
      
      [root@localhost ~]# lspci |grep 3D
      08:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
      [root@localhost ~]# 
      

      reboot will not complete with below error:

      [ 1595.375611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1595.376673] task:reboot          state:D stack:0     pid:24876 tgid:24876 ppid:1      task_flags:0x400100 flags:0x00000204
      [ 1595.378017] Call trace:
      [ 1595.378597]  __switch_to+0xec/0x148
      [ 1595.379249]  __schedule+0x244/0x648
      [ 1595.379897]  schedule+0x3c/0xe0
      [ 1595.380478]  schedule_preempt_disabled+0x2c/0x50
      [ 1595.381128]  __mutex_lock.constprop.0+0x444/0x920
      [ 1595.381749]  __mutex_lock_slowpath+0x1c/0x30
      [ 1595.382345]  mutex_lock+0x6c/0x90
      [ 1595.382846]  device_shutdown+0xc0/0x238
      [ 1595.383382]  kernel_restart+0x48/0xb8
      [ 1595.383907]  __do_sys_reboot+0x178/0x278
      [ 1595.384450]  __arm64_sys_reboot+0x2c/0x40
      [ 1595.384987]  invoke_syscall.constprop.0+0x74/0xd0
      [ 1595.385571]  do_el0_svc+0xb0/0xe8
      [ 1595.386052]  el0_svc+0x44/0x1d0
      [ 1595.386520]  el0t_64_sync_handler+0x120/0x130
      [ 1595.387076]  el0t_64_sync+0x1a4/0x1a8
      [ 1718.254457] INFO: task reboot:24876 blocked for more than 614 seconds.
      [ 1718.256019]       Tainted: G           OE     -------  ---  6.12.0-74.el10.aarch64+64k #1
      
      

      Additional info:
      The gpu device can be hot-unplugged if there is no nvidia driver.

              ddutile Donald Dutile
              yicui1 Yingshun Cui
              virt-maint virt-maint
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: