-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhos-18.0.10 FR 3
-
None
-
False
-
-
False
-
?
-
rhos-workloads-compute
-
None
-
-
-
-
Moderate
The Nova compute agent incorrectly identifies a GPU passthrough device as dev_type='type-PF' in the resource tracker's final view, even when the configuration explicitly sets it to type-PCI using [pci]alias.
This occurs because the libvirt driver's hardware detection logic identifies the device as a Physical Function (PF) if it is SR-IOV-capable. This hardware detection takes precedence over the operator's configuration in nova.conf. The presence of the <capability type='virt_functions'> element in the device's XML description from libvirt is what triggers this behavior.
To Reproduce Steps to reproduce the behavior:
- Configure a compute node with a GPU that supports SR-IOV (e.g., an NVIDIA L4 or similar).
- In nova.conf on the compute node, configure PCI passthrough for this device using device_spec/alias. Explicitly set the device_type to type-PCI.
- Restart the nova-compute service.
- Observe the nova-compute.log. The logs will show that the configuration is loaded correctly, but the final resource view reported by the resource tracker will show dev_type='type-PF'.
Expected behavior
- The Nova compute agent should honor the device_type specified in the [pci]alias configuration, using it to override the hardware-detected device type. If a device is configured as type-PCI, the resource tracker should report it as such, regardless of its underlying SR-IOV capabilities.
- Or at least the documentation should clarify this configuration value and autodetection logic behavior, and how the former must match the latter, or when it may be not. Upstream we state "device_type
Type of PCI device. Valid values are: type-PCI, type-PF and type-VF. Note that "device_type": "type-PF" must be specified if you wish to passthrough a device that supports SR-IOV in its entirety."
That documentation section could have explained instead what users are not given a free choice between type-PCI, and type-PF, because that value must be matching to what libvirt and nova compute agent autodectes from HW. Or in downstream docs as well
Screenshots
- Attached Image
Device Info (please complete the following information):
Bug impact
- The incorrect dev_type prevents the Nova scheduler from correctly matching the device for instances that require a type-PCI device. This is particularly problematic for use cases like full GPU passthrough with non-GRID NVIDIA drivers, which cannot be loaded in the guest if the device is presented in SR-IOV mode.
Known workaround
- Disable SR-IOV for affected PCI devices in BIOS. Note that setting kernel args like pci=nosriov won't help with that problem.
Additional context
-
- Configuration:
[pci]
alias = { "vendor_id":"10de", "product_id":"20f1", "device_type":"type-PCI", "name":"nvidia_a2" }
-
- Logs:
The nova-compute.log will show the device being reported incorrectly in the final resource view:
- Logs:
DEBUG nova.compute.resource_tracker [...] Final resource view: ... pci_stats=[PciDevicePool(...,tags=
{...,dev_type='type-PF',...})]