Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-22865

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Blocker Blocker
    • rhel-8.8.0.z
    • rhel-8.8.0.z
    • glibc
    • None
    • glibc-2.28-225.el8_8.9
    • None
    • Important
    • ZStream
    • rhel-pt-c-libs
    • ssg_platform_tools
    • 1
    • False
    • False
    • Hide

      None

      Show
      None
    • Yes
    • Red Hat Enterprise Linux
    • None
    • Enhancement
    • Hide
      .Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc`

      Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.
      Show
      .Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc` Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.
    • Proposed
    • x86_64
    • None

      [clone of RHELPLAN-152599]

      Description of problem:
      Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.

      Version-Release number of selected component (if applicable):

      How reproducible:
      The customer used the following example to demonstrate the problem.

      1. perf bench mem memcpy -f default --nr_loops 500 --size 3MB

      That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible.

      Steps to Reproduce:
      Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server.

      Additional info:
      Thanks to great triaging help from Carlos O'Donell, the problem is understood.
      It turns out glibc is selecting a sub-optimal memcpy routine for that processor.

      On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.

      On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".

      On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.

      For example, slow and fast cases:

      1. perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
        5.468937 GB/sec
      1. GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
        > perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
        12.508272 GB/sec

      I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:

      1. gcc -O memcpy.c -o memcpy
      2. ./memcpy --help
        USAGE: ./memcpy size-in-MB loop-iterations
      1. ./memcpy 3 500
        Rate for 500 3MB memcpy iterations: 7.30 GB/sec
      1. GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
        Rate for 500 3MB memcpy iterations: 27.29 GB/sec

      The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.

              rhn-engineering-dj DJ Delorie
              rhn-engineering-dj DJ Delorie
              DJ Delorie DJ Delorie
              Martin Coufal Martin Coufal
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: