-
Bug
-
Resolution: Done
-
Undefined
-
None
-
rhel-8.10
-
None
-
None
-
sst_kernel_maintainers
-
ssg_core_kernel
-
None
-
False
-
-
None
-
None
-
None
-
None
-
-
x86_64
-
None
What were you trying to do that didn't work?
Running "perf record binary" on AMD Bergamo server amd-epyc4-zen4c-bergamo-9754-1s.lab.eng.brq2.redhat.com leads to an infinite loop of Call Traces:
WARNING: CPU: 230 PID: 835838 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304 Modules linked in: binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul wmi_bmof sunrpc acpi_ipmi ghash_clmulni_intel cdc_ether joydev usbnet rapl sp5100_tco pcspkr mii ipmi_si ipmi_devintf k10temp i2c_piix4 ccp ipmi_msghandler wmi acpi_cpufreq vfat fat fuse xfs libcrc32c sr_mod cdrom sg ast drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm crc32c_intel ahci nvme igb libahci nvme_core libata t10_pi dca i2c_algo_bit uas usb_storage [last unloaded: stap_5980842448a9d4f295e4fb42b7f97571_5132] CPU: 230 PID: 835838 Comm: abrt-handle-eve Kdump: loaded Tainted: G W OE --------- - - 4.18.0-526.el8.x86_64 #1 Hardware name: ASUSTeK COMPUTER INC. RS500A-E12-RS12U VR23005466/K14PA-U24 Series, BIOS 1101 07/18/2023 RIP: 0010:amd_pmu_v2_handle_irq+0x2f8/0x304 Code: 0f ba f6 3a e9 c4 fd ff ff 31 f6 48 89 df e8 bf b9 ff ff 4c 8b 44 24 30 e9 af fe ff ff c7 44 24 38 00 00 00 00 e9 b9 fe ff ff <0f> 0b e9 bb fe ff ff e8 dc bb 0e 00 48 c7 c7 38 47 0d 83 c6 05 4a RSP: 0000:ff5afa8e4c61fcc0 EFLAGS: 00010002 RAX: 0000000000000005 RBX: ff4876a797182440 RCX: ff5afa8e4c61ff58 RDX: ff5afa8e4c61ff58 RSI: ff5afa8e4c61fd00 RDI: ff4876a797182440 RBP: ff5afa8e4c61fe90 R08: fffffffffffffffe R09: 0000000000032940 R10: 000004de0fd1c1a0 R11: 0000000000000000 R12: ff4876d5cc997ca0 R13: 0000000000000006 R14: 0000000000000002 R15: ff4876d5cc997ea0 FS: 00007f0d111d8bc0(0000) GS:ff4876d5cc980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b11dcfa4f8 CR3: 0000009150900001 CR4: 0000000000771ee0 PKRU: 55555554 Call Trace: ? __warn+0x94/0xe0 ? amd_pmu_v2_handle_irq+0x2f8/0x304 ? amd_pmu_v2_handle_irq+0x2f8/0x304 ? report_bug+0xb1/0xe0 ? page_counter_try_charge+0x5d/0xe0 ? do_error_trap+0x9e/0xd0 ? do_invalid_op+0x36/0x40 ? amd_pmu_v2_handle_irq+0x2f8/0x304 ? invalid_op+0x14/0x20 ? amd_pmu_v2_handle_irq+0x2f8/0x304 ? atime_needs_update+0x77/0xe0 ? srso_alias_return_thunk+0x5/0xfcdfd ? srso_alias_return_thunk+0x5/0xfcdfd ? srso_alias_return_thunk+0x5/0xfcdfd ? srso_alias_return_thunk+0x5/0xfcdfd ? srso_alias_return_thunk+0x5/0xfcdfd ? down_read+0xe/0xa0 ? srso_alias_return_thunk+0x5/0xfcdfd ? srso_alias_return_thunk+0x5/0xfcdfd ? xfs_iunlock+0xdc/0x110 [xfs] ? srso_alias_return_thunk+0x5/0xfcdfd ? xfs_file_buffered_read+0x53/0xb0 [xfs] ? srso_alias_return_thunk+0x5/0xfcdfd ? xfs_file_read_iter+0x6e/0xe0 [xfs] perf_event_nmi_handler+0x2d/0x50 nmi_handle+0x63/0x110 ? vfs_read+0x121/0x150 default_do_nmi+0x49/0x110 do_nmi+0x1af/0x220 nmi+0xab/0xf4 RIP: 0033:0x7f0d1059ba45 Code: 00 00 00 00 00 e8 bb a7 ff ff 47 8d 6c 2d 01 48 8b 00 f6 40 15 20 0f 84 5f ff ff ff 0f be f3 e9 78 ff ff ff 90 e8 9b a7 ff ff <48> 63 d3 48 8b 00 f6 44 50 01 20 75 e5 83 fb 1f 0f 8e 3b ff ff ff RSP: 002b:00007ffdadd00950 EFLAGS: 00000207 RAX: 00007f0d111d8b50 RBX: 000000000000006d RCX: 000055b11dcf963e RDX: 000000000000006d RSI: 0000000000000020 RDI: 000055b11dcf33a0 RBP: 000055b11dcf33a0 R08: 00007f0d10028bc0 R09: 000055b11da6120e R10: 0000000000000000 R11: 0000000000000246 R12: 000055b11dcf8a00 R13: 0000000000000001 R14: 000055b11dce90e0 R15: 000055b11db12fe0
It's known to be caused by a buggy AMD microcode. See https://issues.redhat.com/browse/RHEL-2150
for details.
In RHEL-8.8.0 a partial workaround in kernel was implemented - see
https://issues.redhat.com/browse/RHEL-12340
Unfortunately, the RHEL-12340 fixes only "perf top" command. "perf record <binary>" eventually leads to the problem.
Please provide the package NVR for which bug is seen:
RHEL-8.10.0-20231121.1 with linux-firmware-20230824-119.git0e048b06.el8_9.noarch and kernel-4.18.0-526.el8
How reproducible:
On AMD Bergamo server with this CPU
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor
run perf record <binary> command.
git clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
cd scheduler-benchmarks/Stress_ng-test
./runtest.sh --iterations 1 --list_of_threads 1 --no_rsync
and watch dmesg in another terminal. You will start getting warnings:
[27881.332024] WARNING: CPU: 41 PID: 521295 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304
followed by Call Traces:
[27881.332067] Call Trace: [27881.332067] <NMI> [27881.332068] ? __warn+0x94/0xe0 [27881.332068] ? amd_pmu_v2_handle_irq+0x2f8/0x304 [27881.332069] ? amd_pmu_v2_handle_irq+0x2f8/0x304 [27881.332069] ? report_bug+0xb1/0xe0
See also: http://faf.lab.eng.brq2.redhat.com:8080/faf/reports/76406/
It's known that the problem is caused by a bug in AMD microcode 0xaa0020f and is fixed by microcode 0xaa00213