-
Bug
-
Resolution: Done-Errata
-
Blocker
-
rhel-8.8.0.z
-
None
-
glibc-2.28-225.el8_8.9
-
None
-
Important
-
ZStream
-
rhel-pt-c-libs
-
ssg_platform_tools
-
1
-
False
-
False
-
-
Yes
-
Red Hat Enterprise Linux
-
None
-
Enhancement
-
-
Proposed
-
-
x86_64
-
None
[clone of RHELPLAN-152599]
Description of problem:
Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.
Version-Release number of selected component (if applicable):
How reproducible:
The customer used the following example to demonstrate the problem.
- perf bench mem memcpy -f default --nr_loops 500 --size 3MB
That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible.
Steps to Reproduce:
Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server.
Additional info:
Thanks to great triaging help from Carlos O'Donell, the problem is understood.
It turns out glibc is selecting a sub-optimal memcpy routine for that processor.
On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.
On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".
On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.
For example, slow and fast cases:
- perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
5.468937 GB/sec
- GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
> perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
12.508272 GB/sec
I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:
- gcc -O memcpy.c -o memcpy
- ./memcpy --help
USAGE: ./memcpy size-in-MB loop-iterations
- ./memcpy 3 500
Rate for 500 3MB memcpy iterations: 7.30 GB/sec
- GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
Rate for 500 3MB memcpy iterations: 27.29 GB/sec
The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.
- links to
-
RHBA-2024:127829 glibc update