Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: rhel-10.1
Component/s: qemu-kvm / Devices
Labels:
- virt-arm

Regression:
No
Severity:
None
Epic Link:
[aarch64] virt-arm: NVIDIA A100 GPUs research
AssignedTeam:
rhel-virt-hwe-arm-1
Sub-System Group:

ssg_virtualization
sprint_count:
1

Story Points:
0
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
Split items

Preliminary Testing:
None
Test Coverage:
None

ProdDocsReview-CCS:
Unspecified
ProdDocsReview-Dev:
Unspecified
ProdDocsReview-QE:
Unspecified

Experience:
Architecture:

aarch64

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

It failed to hot-unplug gpu device in vm with nvidia gpu data centre driver.

Please provide the package NVR for which the bug is seen:

Guest:
Kernel 6.12.0-74.el10.aarch64+64k
NVIDIA-SMI 570.133.20
Driver Version: 570.133.20

Host:
libvirt-10.10.0-8.1.el10_0.aarch64
qemu-kvm-9.1.0-15.el10_0.1.aarch64
6.12.0-55.2.1.el10_0.aarch64+64k

How reproducible is this bug?:

100%

Steps to reproduce

Disable nouveau in guest
Start a vm with gpu passthrough(or hotplug gpu to a running vm)
Install driver(follow https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#nvidia-kernel-modules)
OR:
1. dnf install -y gcc kernel*headers* kernel*devel*
2. dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
3. dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/sbsa/cuda-rhel9.repo
4. dnf -y module install nvidia-driver:open-dkms

Detach the device

# cat  ~/xml_files/hostdev_simple.xml 
<hostdev mode="subsystem" type="pci" managed="yes">
  <source>
    <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
  </source>
</hostdev>

# virsh detach-device avocado-vt-vm1 ~/xml_files/hostdev_simple.xml

# check guest's dmesg and run 'lspci|grep NVIDIA'

run 'reboot' or 'fwupdmgr get-devices' in guest

Expected results

There is no error in dmesg
The command 'lscpi|grep NVIDIA' doesn't return the removed gpu device.

Actual results

The device cannot be removed.

[root@localhost ~]# [  943.369462] pcieport 0000:00:01.7: pciehp: Slot(0-7): Button press: will power off in 5 sec
[  948.412415] NVRM: Attempting to remove device 0000:08:00.0 with non-zero usage count!

[root@localhost ~]# lspci |grep 3D
08:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
[root@localhost ~]#

reboot will not complete with below error:

[ 1595.375611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1595.376673] task:reboot          state:D stack:0     pid:24876 tgid:24876 ppid:1      task_flags:0x400100 flags:0x00000204
[ 1595.378017] Call trace:
[ 1595.378597]  __switch_to+0xec/0x148
[ 1595.379249]  __schedule+0x244/0x648
[ 1595.379897]  schedule+0x3c/0xe0
[ 1595.380478]  schedule_preempt_disabled+0x2c/0x50
[ 1595.381128]  __mutex_lock.constprop.0+0x444/0x920
[ 1595.381749]  __mutex_lock_slowpath+0x1c/0x30
[ 1595.382345]  mutex_lock+0x6c/0x90
[ 1595.382846]  device_shutdown+0xc0/0x238
[ 1595.383382]  kernel_restart+0x48/0xb8
[ 1595.383907]  __do_sys_reboot+0x178/0x278
[ 1595.384450]  __arm64_sys_reboot+0x2c/0x40
[ 1595.384987]  invoke_syscall.constprop.0+0x74/0xd0
[ 1595.385571]  do_el0_svc+0xb0/0xe8
[ 1595.386052]  el0_svc+0x44/0x1d0
[ 1595.386520]  el0t_64_sync_handler+0x120/0x130
[ 1595.387076]  el0t_64_sync+0x1a4/0x1a8
[ 1718.254457] INFO: task reboot:24876 blocked for more than 614 seconds.
[ 1718.256019]       Tainted: G           OE     -------  ---  6.12.0-74.el10.aarch64+64k #1

Additional info:
The gpu device can be hot-unplugged if there is no nvidia driver.

blocks

RHEL-93084 [aarch64] virt-arm: NVIDIA GPU related research

Closed

Assignee:: Donald Dutile (Inactive)

Reporter:: Yingshun Cui

Developer:: virt-maint

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2025/04/23 5:36 AM

Updated:: 2025/09/26 11:57 AM

Resolved:: 2025/08/28 10:20 PM

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates