Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-20812

Nova compute agent reports dev_type as type-PF for GPUs despite type-PCI configuration

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhos-18.0.10 FR 3
    • openstack-nova
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • rhos-workloads-compute
    • None
    • Moderate

      The Nova compute agent incorrectly identifies a GPU passthrough device as dev_type='type-PF' in the resource tracker's final view, even when the configuration explicitly sets it to type-PCI using [pci]alias.

      This occurs because the libvirt driver's hardware detection logic identifies the device as a Physical Function (PF) if it is SR-IOV-capable. This hardware detection takes precedence over the operator's configuration in nova.conf. The presence of the <capability type='virt_functions'> element in the device's XML description from libvirt is what triggers this behavior.

      To Reproduce Steps to reproduce the behavior:

      1. Configure a compute node with a GPU that supports SR-IOV (e.g., an NVIDIA L4 or similar).
      2. In nova.conf on the compute node, configure PCI passthrough for this device using device_spec/alias. Explicitly set the device_type to type-PCI.
      3. Restart the nova-compute service.
      4. Observe the nova-compute.log. The logs will show that the configuration is loaded correctly, but the final resource view reported by the resource tracker will show dev_type='type-PF'.

      Expected behavior

      • The Nova compute agent should honor the device_type specified in the [pci]alias configuration, using it to override the hardware-detected device type. If a device is configured as type-PCI, the resource tracker should report it as such, regardless of its underlying SR-IOV capabilities.
      • Or at least the documentation should clarify this configuration value and autodetection logic behavior, and how the former must match the latter, or when it may be not. Upstream we state "device_type
            Type of PCI device. Valid values are: type-PCI, type-PF and type-VF. Note that "device_type": "type-PF" must be specified if you wish to passthrough a device that supports SR-IOV in its entirety."

      That documentation section could have explained instead what users are not given a free choice between type-PCI, and type-PF, because that value must be matching to what libvirt and nova compute agent autodectes from HW. Or in downstream docs as well

      Screenshots

      • Attached Image

      Device Info (please complete the following information):

        •  

      Bug impact

      • The incorrect dev_type prevents the Nova scheduler from correctly matching the device for instances that require a type-PCI device. This is particularly problematic for use cases like full GPU passthrough with non-GRID NVIDIA drivers, which cannot be loaded in the guest if the device is presented in SR-IOV mode.

      Known workaround

      • Disable SR-IOV for affected PCI devices in BIOS. Note that setting kernel args like pci=nosriov won't help with that problem.

      Additional context

        • Configuration:

      [pci]
      alias = { "vendor_id":"10de", "product_id":"20f1", "device_type":"type-PCI", "name":"nvidia_a2" }
       

        • Logs:
          The nova-compute.log will show the device being reported incorrectly in the final resource view:

      DEBUG nova.compute.resource_tracker [...] Final resource view: ... pci_stats=[PciDevicePool(...,tags=

      {...,dev_type='type-PF',...}

      )]

              Unassigned Unassigned
              bdobreli@redhat.com Bohdan Dobrelia
              rhos-workloads-compute
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: