Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-25530

glibc: Backport AMD ERMS performance fix (swbz#30994) [rhel-10]

    • glibc-2.39-7.el10
    • None
    • None
    • Patch
    • 1
    • rhel-sst-pt-libraries
    • ssg_platform_tools
    • 13
    • 3
    • Yes
    • Red Hat Enterprise Linux
    • SST PT Libraries Sprint 4
    • Enhancement
    • Hide
      .Optimization of AMD Zen 3 and Zen 4 performance in `glibc`

      Previously, AMD Zen 3 and Zen 4 processors sometimes used the Enhanced Repeat Move String (ERMS) version of the `memcpy` and `memmove` library routines regardless of the most optimal choice. With this update to `glibc`, AMD Zen 3 and Zen 4 processors use the most optimal versions of `memcpy` and `memmove`.
      Show
      .Optimization of AMD Zen 3 and Zen 4 performance in `glibc` Previously, AMD Zen 3 and Zen 4 processors sometimes used the Enhanced Repeat Move String (ERMS) version of the `memcpy` and `memmove` library routines regardless of the most optimal choice. With this update to `glibc`, AMD Zen 3 and Zen 4 processors use the most optimal versions of `memcpy` and `memmove`.
    • Done
    • x86_64
    • None

      Upstream fixed a string function performance issue on certain AMD CPUs.

      commit 491e55beab7457ed310a4a47496f4a333c5d1032
      Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
      Date:   Thu Feb 8 10:08:40 2024 -0300
      
          x86: Expand the comment on when REP STOSB is used on memset
          
          Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
      
      commit 272708884cb750f12f5c74a00e6620c19dc6d567
      Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
      Date:   Thu Feb 8 10:08:39 2024 -0300
      
          x86: Do not prefer ERMS for memset on Zen3+
          
          For AMD Zen3+ architecture, the performance of the vectorized loop is
          slightly better than ERMS.
          
          Checked on x86_64-linux-gnu on Zen3.
          Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
      
      commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e
      Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
      Date:   Thu Feb 8 10:08:38 2024 -0300
      
          x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
          
          The REP MOVSB usage on memcpy/memmove does not show much performance
          improvement on Zen3/Zen4 cores compared to the vectorized loops.  Also,
          as from BZ 30994, if the source is aligned and the destination is not
          the performance can be 20x slower.
          
          The performance difference is noticeable with small buffer sizes, closer
          to the lower bounds limits when memcpy/memmove starts to use ERMS.  The
          performance of REP MOVSB is similar to vectorized instruction on the
          size limit (the L2 cache).  Also, there is no drawback to multiple cores
          sharing the cache.
          
          Checked on x86_64-linux-gnu on Zen3.
          Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
      

              xmcoufal Martin Coufal
              fweimer@redhat.com Florian Weimer
              Platform Tools - Libraries Bot Platform Tools - Libraries Bot
              Martin Coufal Martin Coufal
              Tomas Capek Tomas Capek
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: