Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-8.10
Component/s: linux-firmware
Labels:
- known_issue_810

Regression:
None
Severity:
None

AssignedTeam:
rhel-kernel-maint
Sub-System Group:

ssg_core_kernel

Story Points:
None
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:

Manual

Experience:
Architecture:

x86_64

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

Running "perf record binary" on AMD Bergamo server amd-epyc4-zen4c-bergamo-9754-1s.lab.eng.brq2.redhat.com leads to an infinite loop of Call Traces:

WARNING: CPU: 230 PID: 835838 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304
Modules linked in: binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul wmi_bmof sunrpc acpi_ipmi ghash_clmulni_intel cdc_ether joydev usbnet rapl sp5100_tco pcspkr mii ipmi_si ipmi_devintf k10temp i2c_piix4 ccp ipmi_msghandler wmi acpi_cpufreq vfat fat fuse xfs libcrc32c sr_mod cdrom sg ast drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm crc32c_intel ahci nvme igb libahci nvme_core libata t10_pi dca i2c_algo_bit uas usb_storage [last unloaded: stap_5980842448a9d4f295e4fb42b7f97571_5132]
CPU: 230 PID: 835838 Comm: abrt-handle-eve Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-526.el8.x86_64 #1
Hardware name: ASUSTeK COMPUTER INC. RS500A-E12-RS12U VR23005466/K14PA-U24 Series, BIOS 1101 07/18/2023
RIP: 0010:amd_pmu_v2_handle_irq+0x2f8/0x304
Code: 0f ba f6 3a e9 c4 fd ff ff 31 f6 48 89 df e8 bf b9 ff ff 4c 8b 44 24 30 e9 af fe ff ff c7 44 24 38 00 00 00 00 e9 b9 fe ff ff <0f> 0b e9 bb fe ff ff e8 dc bb 0e 00 48 c7 c7 38 47 0d 83 c6 05 4a
RSP: 0000:ff5afa8e4c61fcc0 EFLAGS: 00010002
RAX: 0000000000000005 RBX: ff4876a797182440 RCX: ff5afa8e4c61ff58
RDX: ff5afa8e4c61ff58 RSI: ff5afa8e4c61fd00 RDI: ff4876a797182440
RBP: ff5afa8e4c61fe90 R08: fffffffffffffffe R09: 0000000000032940
R10: 000004de0fd1c1a0 R11: 0000000000000000 R12: ff4876d5cc997ca0
R13: 0000000000000006 R14: 0000000000000002 R15: ff4876d5cc997ea0
FS:  00007f0d111d8bc0(0000) GS:ff4876d5cc980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b11dcfa4f8 CR3: 0000009150900001 CR4: 0000000000771ee0
PKRU: 55555554
Call Trace:
 ? __warn+0x94/0xe0
 ? amd_pmu_v2_handle_irq+0x2f8/0x304
 ? amd_pmu_v2_handle_irq+0x2f8/0x304
 ? report_bug+0xb1/0xe0
 ? page_counter_try_charge+0x5d/0xe0
 ? do_error_trap+0x9e/0xd0
 ? do_invalid_op+0x36/0x40
 ? amd_pmu_v2_handle_irq+0x2f8/0x304
 ? invalid_op+0x14/0x20
 ? amd_pmu_v2_handle_irq+0x2f8/0x304
 ? atime_needs_update+0x77/0xe0
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? down_read+0xe/0xa0
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? xfs_iunlock+0xdc/0x110 [xfs]
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? xfs_file_buffered_read+0x53/0xb0 [xfs]
 ? srso_alias_return_thunk+0x5/0xfcdfd
 ? xfs_file_read_iter+0x6e/0xe0 [xfs]
 perf_event_nmi_handler+0x2d/0x50
 nmi_handle+0x63/0x110
 ? vfs_read+0x121/0x150
 default_do_nmi+0x49/0x110
 do_nmi+0x1af/0x220
 nmi+0xab/0xf4
RIP: 0033:0x7f0d1059ba45
Code: 00 00 00 00 00 e8 bb a7 ff ff 47 8d 6c 2d 01 48 8b 00 f6 40 15 20 0f 84 5f ff ff ff 0f be f3 e9 78 ff ff ff 90 e8 9b a7 ff ff <48> 63 d3 48 8b 00 f6 44 50 01 20 75 e5 83 fb 1f 0f 8e 3b ff ff ff
RSP: 002b:00007ffdadd00950 EFLAGS: 00000207
RAX: 00007f0d111d8b50 RBX: 000000000000006d RCX: 000055b11dcf963e
RDX: 000000000000006d RSI: 0000000000000020 RDI: 000055b11dcf33a0
RBP: 000055b11dcf33a0 R08: 00007f0d10028bc0 R09: 000055b11da6120e
R10: 0000000000000000 R11: 0000000000000246 R12: 000055b11dcf8a00
R13: 0000000000000001 R14: 000055b11dce90e0 R15: 000055b11db12fe0

It's known to be caused by a buggy AMD microcode. See https://issues.redhat.com/browse/RHEL-2150

for details.

In RHEL-8.8.0 a partial workaround in kernel was implemented - see

https://issues.redhat.com/browse/RHEL-12340

Unfortunately, the RHEL-12340 fixes only "perf top" command. "perf record <binary>" eventually leads to the problem.

Please provide the package NVR for which bug is seen:

RHEL-8.10.0-20231121.1 with linux-firmware-20230824-119.git0e048b06.el8_9.noarch and kernel-4.18.0-526.el8

How reproducible:

On AMD Bergamo server with this CPU

cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

run perf record <binary> command.

git clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
cd scheduler-benchmarks/Stress_ng-test
./runtest.sh --iterations 1 --list_of_threads 1 --no_rsync

and watch dmesg in another terminal. You will start getting warnings:

[27881.332024] WARNING: CPU: 41 PID: 521295 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304

followed by Call Traces:

[27881.332067] Call Trace:                                               
[27881.332067]  <NMI>                                       
[27881.332068]  ? __warn+0x94/0xe0                                      
[27881.332068]  ? amd_pmu_v2_handle_irq+0x2f8/0x304                    
[27881.332069]  ? amd_pmu_v2_handle_irq+0x2f8/0x304         
[27881.332069]  ? report_bug+0xb1/0xe0

See also: http://faf.lab.eng.brq2.redhat.com:8080/faf/reports/76406/

It's known that the problem is caused by a bug in AMD microcode 0xaa0020f and is fixed by microcode 0xaa00213

Assignee:: Denys Vlasenko

Reporter:: Jiri Hladky

Developer:: Denys Vlasenko

QA Contact:: Laura Trivelloni

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/11/28 2:46 AM

Updated:: 2025/01/31 12:00 PM

Resolved:: 2023/11/28 1:42 PM

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Attachments

Easy Agile Planning Poker

Activity

People

Dates