-
Bug
-
Resolution: Done
-
Major
-
None
-
rhel-10.1
-
rhel-virt-hwe-arm-1
-
ssg_virtualization
-
0
-
False
-
False
-
-
None
-
Split items
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
-
aarch64
-
None
What were you trying to do that didn't work?
It failed to hot-unplug gpu device in vm with nvidia gpu data centre driver.
Please provide the package NVR for which the bug is seen:
Guest:
Kernel 6.12.0-74.el10.aarch64+64k
NVIDIA-SMI 570.133.20
Driver Version: 570.133.20
Host:
libvirt-10.10.0-8.1.el10_0.aarch64
qemu-kvm-9.1.0-15.el10_0.1.aarch64
6.12.0-55.2.1.el10_0.aarch64+64k
How reproducible is this bug?:
100%
Steps to reproduce
- Disable nouveau in guest
- Start a vm with gpu passthrough(or hotplug gpu to a running vm)
- Install driver(follow https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#nvidia-kernel-modules)
OR:- dnf install -y gcc kernel*headers* kernel*devel*
- dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
- dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/sbsa/cuda-rhel9.repo
- dnf -y module install nvidia-driver:open-dkms
- Detach the device
# cat ~/xml_files/hostdev_simple.xml <hostdev mode="subsystem" type="pci" managed="yes"> <source> <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/> </source> </hostdev> # virsh detach-device avocado-vt-vm1 ~/xml_files/hostdev_simple.xml
# check guest's dmesg and run 'lspci|grep NVIDIA'
- run 'reboot' or 'fwupdmgr get-devices' in guest
Expected results
There is no error in dmesg
The command 'lscpi|grep NVIDIA' doesn't return the removed gpu device.
Actual results
The device cannot be removed.
[root@localhost ~]# [ 943.369462] pcieport 0000:00:01.7: pciehp: Slot(0-7): Button press: will power off in 5 sec [ 948.412415] NVRM: Attempting to remove device 0000:08:00.0 with non-zero usage count! [root@localhost ~]# lspci |grep 3D 08:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1) [root@localhost ~]#
reboot will not complete with below error:
[ 1595.375611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1595.376673] task:reboot state:D stack:0 pid:24876 tgid:24876 ppid:1 task_flags:0x400100 flags:0x00000204 [ 1595.378017] Call trace: [ 1595.378597] __switch_to+0xec/0x148 [ 1595.379249] __schedule+0x244/0x648 [ 1595.379897] schedule+0x3c/0xe0 [ 1595.380478] schedule_preempt_disabled+0x2c/0x50 [ 1595.381128] __mutex_lock.constprop.0+0x444/0x920 [ 1595.381749] __mutex_lock_slowpath+0x1c/0x30 [ 1595.382345] mutex_lock+0x6c/0x90 [ 1595.382846] device_shutdown+0xc0/0x238 [ 1595.383382] kernel_restart+0x48/0xb8 [ 1595.383907] __do_sys_reboot+0x178/0x278 [ 1595.384450] __arm64_sys_reboot+0x2c/0x40 [ 1595.384987] invoke_syscall.constprop.0+0x74/0xd0 [ 1595.385571] do_el0_svc+0xb0/0xe8 [ 1595.386052] el0_svc+0x44/0x1d0 [ 1595.386520] el0t_64_sync_handler+0x120/0x130 [ 1595.387076] el0t_64_sync+0x1a4/0x1a8 [ 1718.254457] INFO: task reboot:24876 blocked for more than 614 seconds. [ 1718.256019] Tainted: G OE ------- --- 6.12.0-74.el10.aarch64+64k #1
Additional info:
The gpu device can be hot-unplugged if there is no nvidia driver.
- blocks
-
RHEL-93084 [aarch64] virt-arm: NVIDIA GPU related research
-
- In Progress
-