-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
rhos-17.1.z
-
False
-
-
False
-
None
Description of problem:
After deleting multiple guests utilizing vGPU instances the mdev's remain claimed and are not removed. Resulting in placement reporting vGPU resources that are less than what the system should be available.
- Usage on host of one of the GPUs before creating instances
[root@computesriov-0 heat-admin]# cat /sys/bus/pci/devices/0000\:04\:00.0/mdev_supported_types/nvidia-319/available_instances
4
- Placement reporting
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack -os-placement
----------------------------------------------------------------------resource_class allocation_ratio min_unit max_unit reserved step_size total ---------------
-------------------------------------------------------VGPU 1.0 1 4 0 1 4 ---------------
-------------------------------------------------------
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 35d7ddbc-a37c-42b6-9b6f-41832865f142
----------------------------------------------------------------------resource_class allocation_ratio min_unit max_unit reserved step_size total ---------------
-------------------------------------------------------VGPU 1.0 1 4 0 1 4 ---------------
-------------------------------------------------------
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack -os-placement
----------------------------------------------------------------------resource_class allocation_ratio min_unit max_unit reserved step_size total ---------------
-------------------------------------------------------VGPU 1.0 1 4 0 1 4 ---------------
-------------------------------------------------------
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 9ccde640-6755-451b-acb9-9b4dc2af643f
----------------------------------------------------------------------
resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
----------------------------------------------------------------------
VGPU | 1.0 | 1 | 4 | 0 | 1 | 4 |
----------------------------------------------------------------------
- Run whitebox to create, resize, and migrate vGPU enabled instances
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ tempest run --serial --regex whitebox_tempest_plugin.api.compute.test_vgpu | tee vgpu_smoke_tests.log {0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUColdMigration.test_revert_vgpu_cold_migration [59.047790s] ... ok{0}whitebox_tempest_plugin.api.compute.test_vgpu.VGPUColdMigration.test_vgpu_cold_migration [35.478767s] ... ok
{0} whitebox_tempest_plugin.api.compute.test_vgpu.VGPUResizeInstance.test_standard_to_vgpu_resize [43.503475s] ... ok{0}whitebox_tempest_plugin.api.compute.test_vgpu.VGPUResizeInstance.test_vgpu_to_standard_resize [37.093947s] ... ok
{0}whitebox_tempest_plugin.api.compute.test_vgpu.VGPUSanity.test_boot_instance_with_vgpu [19.356411s] ... ok
======
Totals
======
Ran: 5 tests in 233.1257 sec.
- Passed: 5
- Skipped: 0
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 194.4804 sec.
==============
Worker Balance
==============
- Worker 0 (5 tests) => 0:03:53.125749
- Confirm there are no more instances running
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack server list --all-projects
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$
- Check placement reporting now
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 1e3e7132-83ce-40b8-80f1-0da01d12e067
----------------------------------------------------------------------resource_class allocation_ratio min_unit max_unit reserved step_size total ---------------
-------------------------------------------------------VGPU 1.0 1 3 0 1 3 ---------------
-------------------------------------------------------
- Check previous instances
[heat-admin@computesriov-0 ~]$ sudo cat /sys/bus/pci/devices/0000\:04\:00.0/mdev_supported_types/nvidia-319/available_instances
3
- Not seeing any issues when attempting to remove via host and container
[root@computesriov-0 0000:04:00.0]# ls
5a49f19e-5dde-49cb-8a37-12734913db95 class device i2c-0 local_cpulist msi_bus rescan resource1_wc sriov_offset subsystem_device
aer_dev_correctable config dma_mask_bits i2c-1 local_cpus msi_irqs reset resource3 sriov_stride subsystem_vendor
aer_dev_fatal consistent_dma_mask_bits driver iommu max_link_speed numa_node reset_method resource3_wc sriov_totalvfs uevent
aer_dev_nonfatal current_link_speed driver_override iommu_group max_link_width power resource revision sriov_vf_device vendor
ari_enabled current_link_width enable irq mdev_supported_types power_state resource0 sriov_drivers_autoprobe sriov_vf_total_msix
broken_parity_status d3cold_allowed firmware_node link modalias remove resource1 sriov_numvfs subsystem
[root@computesriov-0 heat-admin]# echo 1 > /sys/bus/mdev/devices/5a49f19e-5dde-49cb-8a37-12734913db95/remove
[root@computesriov-0 heat-admin]# ls /sys/bus/pci/devices/0000\:04\:00.0/
aer_dev_correctable class d3cold_allowed enable iommu_group max_link_speed msi_irqs rescan resource1 sriov_drivers_autoprobe sriov_vf_device uevent
aer_dev_fatal config device firmware_node irq max_link_width numa_node reset resource1_wc sriov_numvfs sriov_vf_total_msix vendor
aer_dev_nonfatal consistent_dma_mask_bits dma_mask_bits i2c-0 link mdev_supported_types power reset_method resource3 sriov_offset subsystem
ari_enabled current_link_speed driver i2c-1 local_cpulist modalias power_state resource resource3_wc sriov_stride subsystem_device
broken_parity_status current_link_width driver_override iommu local_cpus msi_bus remove resource0 revision sriov_totalvfs subsystem_vendor
[root@computesriov-0 heat-admin]# podman exec -it -u root nova_virtqemud /bin/bash
[root@computesriov-0 /]# ls /sys/bus/pci/devices/0000\:82\:00.0/
3514412c-0fbc-4c1d-bc36-434ceaeecfff ari_enabled current_link_speed driver i2c-3 local_cpulist modalias power_state resource resource3_wc sriov_stride subsystem_device
aer_dev_correctable broken_parity_status current_link_width driver_override iommu local_cpus msi_bus remove resource0 revision sriov_totalvfs subsystem_vendor
aer_dev_fatal class d3cold_allowed enable iommu_group max_link_speed msi_irqs rescan resource1 sriov_drivers_autoprobe sriov_vf_device uevent
aer_dev_nonfatal config device firmware_node irq max_link_width numa_node reset resource1_wc sriov_numvfs sriov_vf_total_msix vendor
af68c4f2-1630-4b3d-a5aa-88e25e72c047 consistent_dma_mask_bits dma_mask_bits i2c-2 link mdev_supported_types power reset_method resource3 sriov_offset subsystem
[root@computesriov-0 /]# echo 1 > /sys/bus/pci/devices/af68c4f2-1630-4b3d-a5aa-88e25e72c047/remove
[root@computesriov-0 /]# echo 1 > /sys/bus/mdev/devices/3514412c-0fbc-4c1d-bc36-434ceaeecfff/remove
[root@computesriov-0 /]# ls sys/bus/pci/devices/0000\:82\:00.0/
aer_dev_correctable class d3cold_allowed enable iommu_group max_link_speed msi_irqs rescan resource1 sriov_drivers_autoprobe sriov_vf_device uevent
aer_dev_fatal config device firmware_node irq max_link_width numa_node reset resource1_wc sriov_numvfs sriov_vf_total_msix vendor
aer_dev_nonfatal consistent_dma_mask_bits dma_mask_bits i2c-2 link mdev_supported_types power reset_method resource3 sriov_offset subsystem
ari_enabled current_link_speed driver i2c-3 local_cpulist modalias power_state resource resource3_wc sriov_stride subsystem_device
broken_parity_status current_link_width driver_override iommu local_cpus msi_bus remove resource0 revision sriov_totalvfs subsystem_vendor
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 1e3e7132-83ce-40b8-80f1-0da01d12e067
----------------------------------------------------------------------
resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
----------------------------------------------------------------------
VGPU | 1.0 | 1 | 4 | 0 | 1 | 4 |
----------------------------------------------------------------------
(.tempest) (overcloud) [stack@undercloud-0 tempest_workspace]$ openstack --os-placement-api-version 1.17 resource provider inventory list 9ccde640-6755-451b-acb9-9b4dc2af643f
----------------------------------------------------------------------
resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
----------------------------------------------------------------------
VGPU | 1.0 | 1 | 4 | 0 | 1 | 4 |
----------------------------------------------------------------------
Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220526.n.0
How reproducible:
100%
Steps to Reproduce:
1. Deploy a 17 environment that supports vGPU instances
2. Create one or two vGPU instances, preform several movement actions, delete instances
3.
Actual results:
Instance is deleted but associated mdev instance remains
Expected results:
All resources are correctly deleted and placement reports correct availability for VGPU
Additional info:
Bed can be made available if necessary
- external trackers