Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-71813

netkvm: evaluate merging header and data to optimize host throughput

Linking RHIVOS CVEs to...Migration: Automation ...RHELPRIO AssignedTeam ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Important
    • 1
    • rhel-virt-windows
    • ssg_virtualization
    • 8
    • QE ack
    • False
    • False
    • Hide

      None

      Show
      None
    • Yes
    • Red Hat Enterprise Linux
    • Virtio-win 02/Apr- 15/Apr
    • Pass
    • None
    • Enhancement
    • Hide
      Feature, enhancement:
      Reason:
      Result:
      Show
      Feature, enhancement: Reason: Result:
    • Proposed
    • x86_64
    • Windows
    • None

      Goal

      Evaluate the feasibility of merging the virtio header and data packet into the same memory block to reduce DMA operations, optimize PCIe bandwidth usage, and improve overall network throughput.

      Current State

      In the current reception logic of netkvm, the virtio protocol headers and data packets are in two separate memory blocks. So, at least two memory blocks are needed for one descriptor. From the host's perspective, the network card (hardware implementation) requires two DMA operations to retrieve a single packet, thus consuming more PCIe bandwidth.

      Upstream issue

      netkvm: Enhancing Host Throughput by Combining Virtio Header and Data in a Single Memory Block for NetKVM #1078

      Implementation

      The driver prefers to merge together the virtio header and the network packet data on RX path (when such a merge is possible). Additionally there is a new configuration parameter "Init.SeparateRxTail", by default set to "Enable".

      If works as following:

      Typically the driver needs input buffers of [64K + 30 bytes].

      • if "SeparateTail" is enabled, the driver tries to allocate the buffers of 64K and the tail of 30 bytes is allocated separately in smaller memory block
      • if "SeparateTail" is disabled, the driver tries to allocate contiguous buffer of [64K + 30 bytes], the last page of 4K is mostly free, but mostly there is no use of it. So for queue of 256 entries the system in fact will have ~1M of valuable hardware memory that can't be efficiently used. But, in exchange, the hardware virtio-net solution will spend less time of RX DMA transactions because the entire RX buffer is contiguous and presented by single virtio-net descriptor.

      For QEMU virtio-net device: In general, the change may present some very small improvement on RX CPU consumption and SeparateTail=Disable may present additional very small improvement on receiving of small packets. Need to check.

      For hardware: AliCloud engineer reported of 5-10% improvement on receiving UDP packets of 1400 bytes.  

              ybendito@redhat.com Yuri Benditovich
              rh-ee-wji Wenkang Ji
              Meirav Dean Meirav Dean
              Wenkang Ji Wenkang Ji
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: