-
Spike
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
None
-
2
-
False
-
-
False
-
RHOSSTRAT-185 - NVME passthrough for cloud providers
-
-
Goal
Describe existing Nova interfaces that can be used to implement an external service handling the cleanup.
Assumptions
- NVMe disks are passed through to the VMs by passing through the NVMe controller PCI device from the hypevisor.
- RHOSO 18.0 is deployed and PCI passtrhough is configured for the NVMe controller PCI devices
- PCI in Placement feature is enabled in RHOSO. (In tech preview status as of 18.0-FR1)
- The external cleanup tool has access to the hypervisor and the OpenStack APIs as admin
- Only VM create and delete needs to be handled for now, no migrations, resize, evacuation etc.
Suggested workaround
High level steps:
- Detect the creation and deletion of VMs using NVMe device(s) via Nova notifications
- Reserve the NVMe PCI device via the Placement API when a VM is created allocating the device
- Wipe the NVMe device after the VM is deleted and then unreserve the PCI device via the Placement API
Note that the doc below uses pure Nova config options for simplicity. These need to be translated to RHOSO 18.0 configuration.
Recap for PCI passthrough configuration
nova compute conf:
[pci] device_spec = { "vendor_id":"2646", "product_id":"5013", "device_type": "type-PCI"} alias = { "name": "nvme-type-1", "vendor_id":"2646", "product_id":"5013", "device_type": "type-PCI"}
nova api conf
[pci] alias = { "name": "nvme-type-1", "vendor_id":"2646", "product_id":"5013", "device_type": "type-PCI"}
nova flavor:
$ openstack flavor show m1.nvme1 | grep properties
| properties | pci_passthrough:alias='nvme-type-1:1' |
Recap enable PCI in Placement
Follow the documentation in https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#pci-tracking-in-placement
nova compute conf:
[pci] report_in_placement = True
nova api conf:
[filter_scheduler] pci_in_placement = True
Detect VM creation and deletion via Nova notifications
Configure Nova to emit notification to the message bus
nova compute conf:
[oslo_messaging_notifications] driver = messagingv2 transport_url = <rabbitmq address> notification_format = versioned # or both if other tools are also using the notifications and cannot use the new versioned format
Listen on the message bus for notifications
Example python code to connect and listen for notifications
- the instance.create.end notification is sent after the VM scheduled to a compute host (hypevisor) and nova allocated the requested resources in Placement API for the VM. After this nova will start the VM on the compute host.
- the instance.delete.end notification is sent after the VM is stopped, and deleted from the hypervisor and nova removed the resource allocation of the VM from the Placement API.
For example and instance.delete.end will look like this:
{"message_id": "1323c4b9-bca0-453c-b700-c226769552d1", "publisher_id": "nova-compute:aio", "event_type": "instance.delete.end", "priority": "INFO", "payload": {"nova_object.name": "InstanceActionPayload", "nova_object.namespace": "nova", "nova_object.version": "1.8", "nova_object.data": {"fault": null, "request_id": "req-5847f682-e4dc-44ed-8473-71403424d114", "uuid": "a81880d0-e1f3-4195-8785-9078c899f69e", "user_id": "7e9f6361d07d41b8bd0d2a133c1d5d48", "tenant_id": "82cec4de18334e79b39916d53c3fdaab", "reservation_id": "r-9uplxpf3", "display_name": "vm1", "display_description": null, "host_name": "vm1", "host": "aio", "node": "aio", "os_type": null, "architecture": null, "availability_zone": "nova", "flavor": {"nova_object.name": "FlavorPayload", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"flavorid": "9e4b2a6d-5239-4269-a86d-febeb6400505", "memory_mb": 2048, "vcpus": 1, "root_gb": 4, "ephemeral_gb": 0, "name": "m1.nvme1", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"pci_passthrough:alias": "nvme-type-1:1"}, "projects": null, "description": null}}, "image_uuid": "505d3021-b162-4ca2-a83c-f86637de2d31", "key_name": null, "kernel_id": "", "ramdisk_id": "", "created_at": "2024-11-13T13:34:05Z", "launched_at": "2024-11-13T13:34:18Z", "terminated_at": "2024-11-13T14:34:45Z", "deleted_at": "2024-11-13T14:34:48Z", "updated_at": "2024-11-13T14:34:46Z", "state": "deleted", "power_state": "pending", "task_state": null, "progress": 0, "ip_addresses": [], "block_devices": [], "metadata": {}, "locked": false, "auto_disk_config": "MANUAL", "action_initiator_user": "7e9f6361d07d41b8bd0d2a133c1d5d48", "action_initiator_project": "82cec4de18334e79b39916d53c3fdaab", "locked_reason": null}}, "timestamp": "2024-11-13 14:34:49.430346"}
The interesting bits are:
- FlavorPayload with the extra_spec "pci_passthrough:alias": "nvme-type-1:1" from this the external tool can detect if a VM was using an NVMe device type (pci alias).
- The field "node": "aio" tells the external tool which compute host the VM was running
- The "uuid": "a81880d0-e1f3-4195-8785-9078c899f69e" tells which VM was deleted.
Reserve the PCI device in Placement
To prevent nova to re-assign a PCI device to the next VM before the cleanup can happen the external tool needs to reserve the PCI resource in Placement.
- Based on the instance.create.end notification the external tool can detect if the VM uses a flavor that has PCI alias that matches an NVMe PCI device.
- If so the external tool can look up the allocation for of VM in Placement. The Placement consumer uuid is the VM uuid from the notification. Based on the resource class the external tool can find which resource providers are representing NVMe PCI devices and allocated for the VM. The name of the resource provider encodes both the hostname of the nova compute host the VM is scheduled to and the PCI address of the NVMe device.
- On each of these resource providers the external tool needs change the reserved value of the Placement resource inventory from 0 to 1.
Wipe the device at VM deletion and unreserve the device
The external tool can detect when a VM is deleted via the instance.delete.end notification. It can use its existing information about the devices reserved in Placement for this VM to know which devices on which host needs to be cleaned.
After the tool finished cleaning the device it needs to go back to Placement API and change the reserved value in the inventory from 1 to 0 to signal that the device can be assigned again to the next VM.
Dependencies
- RHOSO 18.0.5 to have the workaround to support many PCI devices with PCI in Placement:
OSPRH-12962 - RHOSO 18.0-FR3 is planned to graduate PCI in Placement from Tech preview to Supported: OSPRH-13106
- RHOSO 18.0-FR3 is planned to support configuring Nova notification message bus via the standard OpenStackControlPlane CR interface: OSPRH-230
Not covered aspects
- how to deploy the external service on top of RHOSO 18.0
- what is the exact RHOSO 18.0 configuration procedure to enable PCI in Placement and Notifications. (only pure Nova config options are provided above)
References
- Configuring PCI passthrough in Nova: https://docs.openstack.org/nova/latest/admin/pci-passthrough.html
- Nova notification interface: https://docs.openstack.org/nova/latest/admin/notifications.html
- Example python code consuming Nova notifications: https://github.com/gibizer/nova-notification-demo/blob/3e81258032efab02a721ca3f694cbfc8cf70b143/ws_forwarder.py#L45-L64
- PCI in Placement: https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#pci-tracking-in-placement
- Placement API reference: https://docs.openstack.org/api-ref/placement/
- is blocked by
-
OSPRH-230 as a user I want to get notifications from the deployed nova cluster
-
- Backlog
-
-
OSPRH-13106 Graduate Flavor based PCI in Placement feature to full support
-
- In Progress
-
-
OSPRH-12962 Backport allocation candidate fixes to 18.0
-
- Closed
-
- relates to
-
OSPRH-14164 Updating reserved value of an inventory created by Nova is undefined behavior
-
- Backlog
-