How vhost-user IOMMU support works

The back-end receives the guest memory as multiple shared memory regions, each annotated with (A) the starting physical guest address, and (B) the starting “user address”, i.e. the virtual address in qemu’s (the front-end’s) address space.

Without an IOMMU, the user address is only used for vring addresses. Otherwise, the back-end will generally work with physical guest addresses, finding its virtual address from those guest addresses.

With an IOMMU, physical guest addresses become completely unused. Instead, the back-end only receives I/O virtual addresses (IOVAs), which are translated to user addresses (in-qemu virtual addresses) via the IOMMU.

For this, the back-end is supposed to have an IOTLB. The front-end, which has the IOMMU, can send updates and invalidations to it; but that will not suffice to translate everything, so the back-end needs to be prepared to declare IOTLB misses and request mappings from the front-end. This is done via the Backend Req mechanism (VHOST_USER_PROTOCOL_F_BACKEND_REQ). With F_REPLY, such requests can be forced to be synchronous, allowing synchronous IOTLB look-ups.

Current state of rust-vmm guest memory handling

Basically everything (except to access the vrings) assumes that there is only two address types we have to deal with:

Physical guest addresses,
Virtual back-end addresses.

The latter is assumed to be convertible to valid pointers to the data.

Basically all translation is done through a GuestMemory implementation (which is a trait): You pass it a physical guest address, a size, and you get a slice to the data in the back-end virtual address space.

What we need with an IOMMU

The vm-virtio crates are built on top of GuestMemory, so any IOVA translation layer must implement that interface, or we’d need to make quite radical changes to those crates.

From a design perspective, there are two major downsides to implementing GuestMemory for an IOVA address space:

IOVAs aren’t guest addresses. They may have different sizes. This is largely a theoretical issue (guest addresses are also defined to be u64, even though this is architecture-dependent), but if possible, it would be preferable to have a strict type separation between IOVAs, guest addresses, and user addresses. We won’t be able to get any type separation if we have to implement GuestMemory, instead having to use the GuestAddress type for IOVAs.
GuestMemory just has the wrong interface. It relies on being separable into several continuous memory regions, which will no longer be true with an IOMMU; if GuestAddress is an IOVA, then the separate regions will not represent continuous GuestAddress ranges (because they represent continuous user address ranges).

The former is a nuisance, the latter is quite bad, and it is unclear yet how much of a problem it represents.

It seems likely that the separability is not relevant to GuestMemory users outside of the vm-memory crate. Instead, it may only used by vm-memory itself to automatically implement the GuestMemory trait methods.

If so, either of two courses of actions seem feasible:

We implement GuestMemory on an IOMMU-aware object, but do not provide the separating methods, e.g. find_region(). Such methods would panic at runtime. We would not use the default GuestMemory method implementations, but reimplement them so they are IOMMU aware. This is quite ugly, but might provide the least invasive solution.
We remove the separability from the GuestMemory trait, instead moving it to a new trait (e.g. PhysicalGuestMemory), on which GuestMemory is then auto-implemented. This would be a breaking change in the interface, but statically prove that the separability is not used outside of the vm-memory crate.

I assume the latter is better for immediate development, and should be the first method to pursue. If the breaking change is not palatable upstream, it should be easily transformable into the first.

New IOTLB code

We need a trait definition for something that can translate IOVAs to user addresses, i.e. an IOMMU.

We need an IOMMU-aware GuestMemory wrapper such that:

It implements GuestMemory (GuestAddress values fed into it are assumed to be IOVAs),
It consists of an object implementing the IOMMU trait, and an object implementing the GuestMemory trait,
It uses the IOMMU object to translate IOVAs into user addresses,
Its contained GuestMemory object allows access based on user addresses (i.e. GuestAddress values fed into it are assumed to be user addresses). This must be minded when constructing this inner object, i.e. its regions must be based on the user addresses provided by the front-end, not the guest addresses.

We need an implementation of this IOMMU trait satisfying the vhost-user model:

This object would be an IOTLB, receiving updates and invalidation requests from the vhost-user Unix domain socket,
On a cache miss or access failure, this object would be able to submit a miss or access failure via the vhost-user back-end req mechanism, and await a response from the front-end before returning the mapping result (or an error). Note that this synchronicity doesn’t really seem desirable, more on that in the respective section below.

Note that this IOTLB would thus need to be aware of vhost-user concepts.

The back-end implementation would be tasked with connecting the IOTLB to the vhost-user back-end req FD as provided by set_backend_req_fd().

We can either implement these things across the existing crates:

The new IOMMU-aware GuestMemory wrapper is provided by the vm-memory crate,
It relies the IOMMU trait, so that must be defined in the same crate (vm-memory),
The IOTLB would be implemented in a vhost crate (vhost or vhost-user-backend) because it needs to be aware of vhost-user concepts.

Alternatively, I believe we could also create a new independent crate that contains all of those, so none of the rust-vmm crates would need to be modified.

It is very much desirable to get these changes into the rust-vmm crates, though, so that is what we should pursue first.

Note on Backend Req multiplexing

The IOTLB implementation will require access to the back-end req channel (which is a Unix domain socket connection). This could interfere with concurrent use for other purposes, especially when relying on synchronous communication (F_REPLY IOTLB messages).

Luckily, vhost::vhost_user::Backend has internal mutability and can just be cloned to allow for multiple concurrent users, so it shouldn’t pose a problem.

Notes on IOTLB look-up synchronicity

An IOTLB miss will trigger a vhost-user back-end request, i.e.:

Sending a network message over a Unix domain socket,
Looking up the address in the front-end’s IOTLB (probably comparatively cheap),
Awaiting and parsing the response message.

This is quite slow. A fully synchronous model, i.e. where we have a synchronous function to read from guest memory, will suffer a lot in performance when encountering IOTLB misses.

An asynchronous model would be much better here, where accessing guest memory is allowed to yield, and another guest request could be processed in the meantime, while awaiting the front-end to answer. However, allowing async memory accesses would likely require a major redesign of a lot of vhost components.