Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-10.0.beta
Component/s: gcc
Labels:
- Performance
- Regression

Regression:
Yes
Severity:
None
Epic Link:
RHEL-39417

Pool Team:

rhel-sst-performance-engineering, rhel-sst-pt-gcc
Sub-System Group:

ssg_platform_tools

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
No
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Release Note Type:
Unspecified Release Note Type - Unknown

Experience:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

When comparing RHEL-10 performance against RHEL-9 performance, Phoronix mt-dgemm benchmark runs 2x slower than in RHEL-9. This is using all available CPUs (OpenMP binary). Results from Intel Icelake and AMD Rome:

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Phoronix/intel-icelake-gold-6330-2s.lab.eng.brq2.redhat.com/RHEL-9.4.0-20240406.57vsRHEL-10.0-20240410.11/2024-04-08T18:33:30.203797vs2024-04-18T10:41:47.500000/0dc53ddd-99ce-5324-b175-27438aadf9f7/index.html#mt-dgemm_section

http://reports.perfqe.tpb.lab.eng.brq.redhat.com/testing/sched/reports/Phoronix/amd-epyc2-rome-7502-2s.lab.eng.brq2.redhat.com/RHEL-9.4.0-20240406.57vsRHEL-10.0-20240410.11/2024-04-08T18:33:30.203797vs2024-04-18T10:41:47.500000/43c67162-5f51-5d5d-9c15-07128024567d/index.html#mt-dgemm_section

Phoronix mt-dgemm benchmark:

https://openbenchmarking.org/innhold/47c3d58042d95b8544390d6d5889f79529caba87

https://openbenchmarking.org/test/pts/mt-dgemm

I was able to track down the problem to the GCC version. The issue can be reproduced on RHEL-9, and using only 1 thread:

scl enable gcc-toolset-12 'bash'
export OMP_NUM_THREADS=1

The critical section of the code:

132   #pragma omp parallel for private(sum)
133   for(i = 0; i < N; i++) {
134    for(j = 0; j < N; j++) {
135     sum = 0;
136 
137     for(k = 0; k < N; k++) {
138      sum += matrixA[i*N + k] * matrixB[k*N + j];
139     }
140 
141     matrixC[i*N + j] = (alpha * sum) + (beta * matrixC[i*N + j]);
142    }
143   }

Compiled with

gcc -g -O3 -march=native -fopenmp -fopt-info -o "${bin}" mt-dgemm/src/mt-dgemm.c

I see the following differences in optimizations:

GCC V11:

mt-dgemm/src/mt-dgemm.c:132:11: optimized: basic block part vectorized using 32 byte vectors

GCC V12:

mt-dgemm/src/mt-dgemm.c:137:18: optimized: loop vectorized using 16 byte vectors
mt-dgemm/src/mt-dgemm.c:137:18: optimized: loop turned into non-loop; it never loops

perf annotate shows the following differences:

GCC V11 - FMA instruction is used (operates only on low double-precision floating-point value from xmm register)

  92.63% :   4017c0: vfmadd231sd (%rdx),%xmm5,%xmm0 
   0.04% :   4017c5: add    %rsi,%rdx 
   0.00% :   4017c8: cmp    %rax,%rcx 
   7.27% :   4017cb: jne    4017b8 <main._omp_fn.1+0xe8>

GCC V12 - vector instructions are used, operating on a pair of double-precision values. Most of the time is spent executing vmovhpd (Vector Move High Packed Double-Precision) instruction. It's used to move one double-precision floating-point value from a memory location or XMM register to the high half of another XMM register:

  30.72% :   4017b8: vmovsd (%rax),%xmm0 
   0.01% :   4017bc: add    $0x10,%rdx 
  40.83% :   4017c0: vmovhpd (%rax,%r9,1),%xmm0,%xmm0 
   3.57% :   4017c6: vmulpd -0x10(%rdx),%xmm0,%xmm0 
   0.00% :   4017cb: add    %r10,%rax 
   8.37% :   4017ce: vaddsd %xmm0,%xmm1,%xmm1 
   0.01% :   4017d2: vunpckhpd %xmm0,%xmm0,%xmm0 
  13.16% :   4017d6: vaddsd %xmm1,%xmm0,%xmm1 
   0.00% :   4017da: cmp    %r8,%rdx 
   3.27% :   4017dd: jne    4017b8 <main._omp_fn.1+0xf8>

perf stat -d shows an increase in L1-dcache-load-misses and LLC-load-misses:

GCC v11:

98,178,107,989 L1-dcache-loads       #  506.997 M/sec   (90.91%)
49,241,071,195 L1-dcache-load-misses #   50.15% of all L1-dcache accesses   (90.91%)
46,841,599,720 LLC-loads             #  241.892 M/se    (90.91%)
 6,874,313,062 LLC-load-misses       #   14.68% of all L1-icache accesses   (90.91%)
193.679625668 seconds time elapsed

GCC v12:

73,771,414,994 L1-dcache-loads #  340.012 M/sec (90.91%)
49,468,808,201 L1-dcache-load-misses #   67.06% of all L1-dcache accesses   (90.91%)
46,849,369,007 LLC-loads        #  215.929 M/sec (90.91%)
10,372,074,893 LLC-load-misses  #   22.14% of all L1-icache accesses   (90.91%)
217.005472851 seconds time elapsed

I'm going to upload the results from the Intel Icelake server https://beaker.engineering.redhat.com/view/formir1.slevarna.tpb.lab.eng.brq.redhat.com#details

running RHEL-9.4.0-20240413.1 with

gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)

and

gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)

To reproduce the issue, run the following:

./run.sh
scl enable gcc-toolset-12 'bash'
./run.sh

All log files have the GCC version in the name. Files to check:

perf stat output: mtdgemm-gcc-11.4.1-3_1.log

gcc optimization logs: mtdgemm-gcc-11.4.1-3_compilation.log

perf annotate: mtdgemm-gcc-11.4.1-3.perf.annotate

Could you please investigate if gcc V12 can produce faster code that runs as fast as compiled with gcc V11?

Thanks a lot

Jirka

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-08-14-12-04-31-272.png
61 kB
2024/08/14 10:04 AM
mtdgemm_benchamrk_and_results_from_formir1.slevarna.tpb.lab.eng.brq.redhat.com.tar.xz
9.81 MB
2024/04/21 11:15 AM
mtdgemm-all-cpus.tar.xz
65.05 MB
2024/04/21 1:13 PM
reproducer.c
1 kB
2024/07/24 10:28 PM

Assignee:: Siddhesh Poyarekar

Reporter:: Jiri Hladky

Developer:: Marek Polacek

QA Contact:: Vaclav Kadlcik

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/04/21 11:13 AM

Updated:: 2024/11/14 1:37 PM

Resolved:: 2024/11/04 6:52 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates