Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-17432

"perf record" is causing an infinite loop of Call Traces on AMD Bergamo with AMD microcode 0xaa0020f

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • rhel-8.10
    • linux-firmware
    • None
    • None
    • rhel-sst-kernel-maintainers
    • ssg_core_kernel
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      Running "perf record binary" on AMD Bergamo server amd-epyc4-zen4c-bergamo-9754-1s.lab.eng.brq2.redhat.com leads to an infinite loop of Call Traces:

       

      WARNING: CPU: 230 PID: 835838 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304
      Modules linked in: binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul wmi_bmof sunrpc acpi_ipmi ghash_clmulni_intel cdc_ether joydev usbnet rapl sp5100_tco pcspkr mii ipmi_si ipmi_devintf k10temp i2c_piix4 ccp ipmi_msghandler wmi acpi_cpufreq vfat fat fuse xfs libcrc32c sr_mod cdrom sg ast drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm crc32c_intel ahci nvme igb libahci nvme_core libata t10_pi dca i2c_algo_bit uas usb_storage [last unloaded: stap_5980842448a9d4f295e4fb42b7f97571_5132]
      CPU: 230 PID: 835838 Comm: abrt-handle-eve Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-526.el8.x86_64 #1
      Hardware name: ASUSTeK COMPUTER INC. RS500A-E12-RS12U VR23005466/K14PA-U24 Series, BIOS 1101 07/18/2023
      RIP: 0010:amd_pmu_v2_handle_irq+0x2f8/0x304
      Code: 0f ba f6 3a e9 c4 fd ff ff 31 f6 48 89 df e8 bf b9 ff ff 4c 8b 44 24 30 e9 af fe ff ff c7 44 24 38 00 00 00 00 e9 b9 fe ff ff <0f> 0b e9 bb fe ff ff e8 dc bb 0e 00 48 c7 c7 38 47 0d 83 c6 05 4a
      RSP: 0000:ff5afa8e4c61fcc0 EFLAGS: 00010002
      RAX: 0000000000000005 RBX: ff4876a797182440 RCX: ff5afa8e4c61ff58
      RDX: ff5afa8e4c61ff58 RSI: ff5afa8e4c61fd00 RDI: ff4876a797182440
      RBP: ff5afa8e4c61fe90 R08: fffffffffffffffe R09: 0000000000032940
      R10: 000004de0fd1c1a0 R11: 0000000000000000 R12: ff4876d5cc997ca0
      R13: 0000000000000006 R14: 0000000000000002 R15: ff4876d5cc997ea0
      FS:  00007f0d111d8bc0(0000) GS:ff4876d5cc980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055b11dcfa4f8 CR3: 0000009150900001 CR4: 0000000000771ee0
      PKRU: 55555554
      Call Trace:
       ? __warn+0x94/0xe0
       ? amd_pmu_v2_handle_irq+0x2f8/0x304
       ? amd_pmu_v2_handle_irq+0x2f8/0x304
       ? report_bug+0xb1/0xe0
       ? page_counter_try_charge+0x5d/0xe0
       ? do_error_trap+0x9e/0xd0
       ? do_invalid_op+0x36/0x40
       ? amd_pmu_v2_handle_irq+0x2f8/0x304
       ? invalid_op+0x14/0x20
       ? amd_pmu_v2_handle_irq+0x2f8/0x304
       ? atime_needs_update+0x77/0xe0
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? down_read+0xe/0xa0
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? xfs_iunlock+0xdc/0x110 [xfs]
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? xfs_file_buffered_read+0x53/0xb0 [xfs]
       ? srso_alias_return_thunk+0x5/0xfcdfd
       ? xfs_file_read_iter+0x6e/0xe0 [xfs]
       perf_event_nmi_handler+0x2d/0x50
       nmi_handle+0x63/0x110
       ? vfs_read+0x121/0x150
       default_do_nmi+0x49/0x110
       do_nmi+0x1af/0x220
       nmi+0xab/0xf4
      RIP: 0033:0x7f0d1059ba45
      Code: 00 00 00 00 00 e8 bb a7 ff ff 47 8d 6c 2d 01 48 8b 00 f6 40 15 20 0f 84 5f ff ff ff 0f be f3 e9 78 ff ff ff 90 e8 9b a7 ff ff <48> 63 d3 48 8b 00 f6 44 50 01 20 75 e5 83 fb 1f 0f 8e 3b ff ff ff
      RSP: 002b:00007ffdadd00950 EFLAGS: 00000207
      RAX: 00007f0d111d8b50 RBX: 000000000000006d RCX: 000055b11dcf963e
      RDX: 000000000000006d RSI: 0000000000000020 RDI: 000055b11dcf33a0
      RBP: 000055b11dcf33a0 R08: 00007f0d10028bc0 R09: 000055b11da6120e
      R10: 0000000000000000 R11: 0000000000000246 R12: 000055b11dcf8a00
      R13: 0000000000000001 R14: 000055b11dce90e0 R15: 000055b11db12fe0

       

      It's known to be caused by a buggy AMD microcode. See https://issues.redhat.com/browse/RHEL-2150 

      for details. 

       

      In RHEL-8.8.0 a partial workaround in kernel was implemented - see

      https://issues.redhat.com/browse/RHEL-12340 

       

      Unfortunately, the RHEL-12340  fixes only "perf top" command. "perf record <binary>" eventually leads to the problem. 

      Please provide the package NVR for which bug is seen:

      RHEL-8.10.0-20231121.1 with linux-firmware-20230824-119.git0e048b06.el8_9.noarch and kernel-4.18.0-526.el8

      How reproducible:

      On AMD Bergamo server with this CPU

      cpu family      : 25
      model           : 160
      model name      : AMD EPYC 9754 128-Core Processor

      run perf record <binary> command. 

       

       

      git clone https://gitlab.cee.redhat.com/kernel-performance/sched/scheduler-benchmarks.git
      cd scheduler-benchmarks/Stress_ng-test
      ./runtest.sh --iterations 1 --list_of_threads 1 --no_rsync
      

       

      and watch dmesg in another terminal. You will start getting warnings:

      [27881.332024] WARNING: CPU: 41 PID: 521295 at arch/x86/events/amd/core.c:952 amd_pmu_v2_handle_irq+0x2f8/0x304

                                       

      followed by Call Traces:

      [27881.332067] Call Trace:                                               
      [27881.332067]  <NMI>                                       
      [27881.332068]  ? __warn+0x94/0xe0                                      
      [27881.332068]  ? amd_pmu_v2_handle_irq+0x2f8/0x304                    
      [27881.332069]  ? amd_pmu_v2_handle_irq+0x2f8/0x304         
      [27881.332069]  ? report_bug+0xb1/0xe0   

      See also:      http://faf.lab.eng.brq2.redhat.com:8080/faf/reports/76406/ 

      It's known that the problem is caused by a bug in AMD microcode 0xaa0020f and is fixed by microcode 0xaa00213 

       

                                                                  

              rhn-support-dvlasenk Denys Vlasenko
              jhladky1@redhat.com Jiri Hladky
              Denys Vlasenko Denys Vlasenko
              Laura Trivelloni Laura Trivelloni
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: