Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Blocker
Fix Version/s: rhel-8.8.0.z
Affects Version/s: rhel-8.8.0.z
Component/s: glibc
Labels:
None

Fixed in Build:
glibc-2.28-225.el8_8.9
Regression:
None
Severity:
Important
Keywords:

ZStream

AssignedTeam:
rhel-pt-c-libs
Sub-System Group:

ssg_platform_tools

Story Points:
1
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
Yes
Products:

Red Hat Enterprise Linux
Sprint:
None

Preliminary Testing:
Pass
Errata Link:
https://errata.engineering.redhat.com/advisory/127829
Test Coverage:
None

Release Note Type:
Enhancement
Release Note Text:

Hide
.Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc`

Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.

Show
.Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc` Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.
Release Note Status:
Proposed

Experience:
Architecture:

x86_64

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

[clone of RHELPLAN-152599]

Description of problem:
Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.

Version-Release number of selected component (if applicable):

How reproducible:
The customer used the following example to demonstrate the problem.

perf bench mem memcpy -f default --nr_loops 500 --size 3MB

That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible.

Steps to Reproduce:
Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server.

Additional info:
Thanks to great triaging help from Carlos O'Donell, the problem is understood.
It turns out glibc is selecting a sub-optimal memcpy routine for that processor.

On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.

On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".

On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.

For example, slow and fast cases:

perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
5.468937 GB/sec

GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
> perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
12.508272 GB/sec

I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:

gcc -O memcpy.c -o memcpy
./memcpy --help
USAGE: ./memcpy size-in-MB loop-iterations

./memcpy 3 500
Rate for 500 3MB memcpy iterations: 7.30 GB/sec

GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
Rate for 500 3MB memcpy iterations: 27.29 GB/sec

The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.

links to

RHBA-2024:127829 glibc update

Assignee:: DJ Delorie

Reporter:: DJ Delorie

Developer:: DJ Delorie

QA Contact:: Martin Coufal

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/01/26 8:55 PM

Updated:: 2025/08/21 9:24 AM

Resolved:: 2024/03/19 5:30 PM

Dev Target end:: 2024/02/26

Release Date:: 2024/03/19

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates