[muralimk@sh5x-2p-126-012c ~]$ sudo rpm -ivh rasdaemon-0.6.7-13.el9.x86_64.rpm Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:rasdaemon-0.6.7-13.el9 ################################# [100%] [muralimk@sh5x-2p-126-012c ~]$ whereis rasdaemon rasdaemon: /usr/sbin/rasdaemon /usr/share/man/man1/rasdaemon.1.gz [muralimk@sh5x-2p-126-012c ~]$ rasdaemon --version rasdaemon 0.6.7 [muralimk@sh5x-2p-126-012c ~]$ rasdaemon --help Usage: rasdaemon [OPTION...] RAS daemon to log the RAS events. -d, --disable disable RAS events and exit -e, --enable enable RAS events and exit -f, --foreground run foreground, not daemonize -p, --post-processing Post-processing MCE's with raw register values -r, --record record events via sqlite3 Post-Processing Options: --bank Bank Number --family CPU Family --ipid IPID Register (for SMCA systems only) --model CPU Model --smca AMD SMCA Error Decoding --status Status Register --synd Syndrome Register -?, --help Give this help list --usage Give a short usage message -V, --version Print program version Report bugs to Mauro Carvalho Chehab . used script to inject errors at particular extended error code number: Injected Error at BANK=60(SMU) and Error code 62 ========================================================== sudo dmesg: ------------ [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (19:90:1) MC60_STATUS[-|CE|-|-|PCC|-|SyndV|CECC|-|-|-]: 0x82204000003e0000 [Hardware Error]: PPIN: 0x26609c724019c00f [Hardware Error]: IPID: 0x0001000103b30400, Syndrome: 0x00000001c001c0db [Hardware Error]: System Management Unit Ext. Error Code: 62 [Hardware Error]: cache level: RESV, tx: INSN Execute rasdaemon:" =============== [muralimk@sh5x-2p-126-012c ~]$ sudo rasdaemon -f overriding event (1281) ras:mc_event with new print handler rasdaemon: ras:mc_event event enabled rasdaemon: Enabled event ras:mc_event overriding event (1278) ras:aer_event with new print handler rasdaemon: ras:aer_event event enabled rasdaemon: Enabled event ras:aer_event overriding event (113) mce:mce_record with new print handler rasdaemon: mce:mce_record event enabled rasdaemon: Enabled event mce:mce_record overriding event (1282) ras:extlog_mem_event with new print handler rasdaemon: ras:extlog_mem_event event enabled rasdaemon: Enabled event ras:extlog_mem_event overriding event (1372) net:net_dev_xmit_timeout with new print handler rasdaemon: net:net_dev_xmit_timeout event enabled rasdaemon: Enabled event net:net_dev_xmit_timeout overriding event (1384) devlink:devlink_health_report with new print handler rasdaemon: devlink:devlink_health_report event enabled rasdaemon: Enabled event devlink:devlink_health_report overriding event (1074) block:block_rq_error with new print handler rasdaemon: block:block_rq_error event enabled rasdaemon: Enabled event block:block_rq_error rasdaemon: Listening to events for cpus 0 to 95 rasdaemon: mce_record store: 0x55ba9982ee78 rasdaemon: register inserted at db <...>-887778 [000] 0.031708: mce_record: 2024-06-30 07:27:09 +0000 System Management Unit (bank=60), status= 82204000003e0000, Corrected error, no action required., mci=Processor_context_corrupt CECC, mca=A poison error from a GFX Sub-IP. Ext Err Code: 62, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, synd= 1c001c0db, ipid= 1000103b30400, mcgstatus=0, mcgcap= 140, apicid= 0 Offline rasdaemon parse logs: ============================= [muralimk@sh5x-2p-126-012c ~]$ rasdaemon -p --status 0x82204000003e0000 --ipid 0x1000103b30400 --smca --family 0x19 --model 0x90 --bank 60 2024-06-30 07:28:13 +0000, System Management Unit (bank=60), mca: A poison error from a GFX Sub-IP. Ext Err Code: 62, mci: Processor_context_corrupt CECC, Error Msg: Corrected error, no action required. Example 2: ============ Injected Error at BANK=60(SMU) and Error code 59 dmesg: ----- kernel:[Hardware Error]: Corrected error, no action required. kernel:[Hardware Error]: CPU:0 (19:90:1) MC60_STATUS[-|CE|-|-|PCC|-|SyndV|CECC|-|-|-]: 0x82204000003b0000 kernel:[Hardware Error]: PPIN: 0x26609c724019c00f kernel:[Hardware Error]: IPID: 0x0001000103b30400, Syndrome: 0x00000001c001c0db kernel:[Hardware Error]: System Management Unit Ext. Error Code: 59 kernel:[Hardware Error]: cache level: RESV, tx: INSN Execute rasadeamon: ---------------- rasdaemon: mce_record store: 0x55ceedc027d8 rasdaemon: register inserted at db <...>-887778 [000] 0.031670: mce_record: 2024-06-30 07:20:49 +0000 System Management Unit (bank=60), status= 82204000003b0000, Corrected error, no action required., mci=Processor_context_corrupt CECC, mca=A fatal error from a GFX Sub-IP. Ext Err Code: 59, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, synd= 1c001c0db, ipid= 1000103b30400, mcgstatus=0, mcgcap= 140, apicid= 0 Offline rasdaemon parse logs: ============================= $ rasdaemon -p --status 0x82204000003b0000 --ipid 0x1000103b30400 --smca --family 0x19 --model 0x90 --bank 60 2024-06-30 07:26:09 +0000, System Management Unit (bank=60), mca: A fatal error from a GFX Sub-IP. Ext Err Code: 59, mci: Processor_context_corrupt CECC, Error Msg: Corrected error, no action required. Inject Error ar code:58 ======================= register inserted at db <...>-887778 [000] 0.031667: mce_record: 2024-06-30 07:20:12 +0000 System Management Unit (bank=60), status= 82204000003a0000, Corrected error, no action required., mci=Processor_context_corrupt CECC, mca=A correctable error from a GFX Sub-IP. Ext Err Code: 58, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, synd= 1c001c0db, ipid= 1000103b30400, mcgstatus=0, mcgcap= 140, apicid= 0