-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
rhel-8.6.0.z
-
None
-
Critical
-
sst_network_drivers
-
ssg_networking
-
None
-
False
-
Red Hat Enterprise Linux
-
None
-
-
x86_64
-
None
What were you trying to do that didn't work?
Hot reset firmware setting and failed. This issue didn't occured on rhel8.10.
Please provide the package NVR for which bug is seen:
RHEL-8.6.0-updates-20231213.16
ethtool -i ens1f0
driver: mlx5_core
version: 4.18.0-372.82.1.rt7.241.el8_6.x
firmware-version: 16.35.3006 (MT_0000000080)
expansion-rom-version:
bus-info: 0000:17:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-372.82.1.rt7.241.el8_6.x86_64 root=/dev/mapper/rhel_dell-per75003-root ro intel_iommu=on ksdevice=bootif pci=realloc crashkernel=auto resume=/dev/mapper/rhel_dellper750-03-swap rd.lvm.lv=rhel_dell-per750-03/root rd.lvm.lv=rhel_dell-per750-03/swap console=ttyS0,115200n81
How reproducible: 100%
Steps to reproduce
- mstfwreset -y -d 0000:17:00.0 reset
Minimal reset level for device, 0000:17:00.0:3: Driver restart and PCI reset
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw -Done
-I- Stopping Driver -Done
-I- Resetting PCI -Done
-I- Starting Driver -Failed
-E- Failed to start driver! please start driver manually.
dmesg log
[ 61.487145] mlx5_core 0000:17:00.0: E-Switch: cleanup [ 65.302109] mlx5_core 0000:17:00.1: E-Switch: cleanup [ 81.287427] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 81.287431] {1}[Hardware Error]: event severity: recoverable [ 81.287433] {1}[Hardware Error]: Error 0, type: fatal [ 81.287435] {1}[Hardware Error]: section_type: PCIe error [ 81.287436] {1}[Hardware Error]: port_type: 4, root port [ 81.287437] {1}[Hardware Error]: version: 3.0 [ 81.287438] {1}[Hardware Error]: command: 0x0547, status: 0x4010 [ 81.287441] {1}[Hardware Error]: device_id: 0000:16:04.0 [ 81.287442] {1}[Hardware Error]: slot: 1 [ 81.287443] {1}[Hardware Error]: secondary_bus: 0x17 [ 81.287444] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x347c [ 81.287446] {1}[Hardware Error]: class_code: 000406 [ 81.287447] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003 [ 81.287448] {1}[Hardware Error]: aer_uncor_status: 0x00002000, aer_uncor_mask: 0x01310000 [ 81.287449] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030 [ 81.287450] {1}[Hardware Error]: TLP Header: ffffffff ffffffff ffffffff ffffffff [ 81.287528] pcieport 0000:16:04.0: AER: aer_status: 0x00002000, aer_mask: 0x01310000 [ 81.287531] pcieport 0000:16:04.0: [13] FCP (First) [ 81.287533] pcieport 0000:16:04.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID [ 81.287535] pcieport 0000:16:04.0: AER: aer_uncor_severity: 0x044ef030 [ 81.287538] pci 0000:17:00.0: AER: can't recover (no error_detected callback) [ 81.287539] pci 0000:17:00.1: AER: can't recover (no error_detected callback) [ 82.304812] pcieport 0000:16:04.0: AER: Root Port link has been reset (0) [ 82.304842] pcieport 0000:16:04.0: AER: device recovery failed [ 82.371161] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 82.371195] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 82.382019] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 82.400715] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 82.400986] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 82.504163] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 82.504198] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 82.514694] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 82.533360] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 82.533615] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 82.636753] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 82.636789] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 82.647120] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 82.666190] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 82.666450] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 82.769570] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 82.769605] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 82.779849] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 82.798652] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 82.798925] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 82.902052] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 82.902087] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 82.912338] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 82.931177] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 82.931429] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.034554] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 83.034588] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.044819] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.063535] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.063786] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.166944] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 83.166978] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.177197] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.196038] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.196291] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.299418] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 83.299453] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.309611] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.328394] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.328652] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.431797] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 83.431832] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.441980] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.460737] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.461011] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.564115] mlx5_core 0000:17:00.0: firmware version: 16.35.3006 [ 83.564150] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.574281] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.592942] mlx5_core 0000:17:00.0: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.593214] mlx5_core: probe of 0000:17:00.0 failed with error -5 [ 83.696314] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 83.696349] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.706521] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.725349] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.725645] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 83.828770] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 83.828817] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.838951] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.857571] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.857881] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 83.960985] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 83.961020] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 83.971221] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 83.990114] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 83.990409] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.093495] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.093530] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.103681] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.122382] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.122673] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.225821] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.225856] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.235987] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.254519] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.254856] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.357940] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.357975] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.368098] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.386851] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.387146] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.490440] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.490475] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.500658] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.519242] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.519538] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.622661] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.622696] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.632863] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.651507] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.651837] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.754963] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.754998] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.765103] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.783628] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.783939] mlx5_core: probe of 0000:17:00.1 failed with error -5 [ 84.887047] mlx5_core 0000:17:00.1: firmware version: 16.35.3006 [ 84.887082] mlx5_core 0000:17:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) [ 84.897259] mlx5_core 0000:17:00.1: mlx5_function_setup:1028:(pid 5): enable hca failed [ 84.916227] mlx5_core 0000:17:00.1: probe_one:1499:(pid 5): mlx5_init_one failed with error code -5 [ 84.916523] mlx5_core: probe of 0000:17:00.1 failed with error -5
I try rollback to stock kernel but still no luck.
[root@dell-per750-03 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-372.85.1.el8_6.x86_64 root=/dev/mapper/rhel_dell--per750--03-root ro intel_iommu=on ksdevice=bootif pci=realloc crashkernel=auto resume=/dev/mapper/rhel_dell--per750--03-swap rd.lvm.lv=rhel_dell-per750-03/root rd.lvm.lv=rhel_dell-per750-03/swap console=ttyS0,115200n81
[root@dell-per750-03 ~]# mstfwreset -y -d 0000:17:00.0 resetMinimal reset level for device, 0000:17:00.0:3: Driver restart and PCI reset
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw -Done
-I- Stopping Driver -Done
-I- Resetting PCI -Done
-I- Starting Driver -Failed
-E- Failed to start driver! please start driver manually.
dmesg log
[ 55.961991] mlx5_core 0000:17:00.0: E-Switch: cleanup
[ 59.039966] mlx5_core 0000:17:00.1: E-Switch: cleanup
[ 75.006469] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 75.014731] {1}[Hardware Error]: event severity: recoverable
[ 75.020391] {1}[Hardware Error]: Error 0, type: fatal
[ 75.025530] {1}[Hardware Error]: section_type: PCIe error
[ 75.031103] {1}[Hardware Error]: port_type: 4, root port
[ 75.036587] {1}[Hardware Error]: version: 3.0
[ 75.041120] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ 75.047299] {1}[Hardware Error]: device_id: 0000:16:04.0
[ 75.052785] {1}[Hardware Error]: slot: 1
[ 75.056886] {1}[Hardware Error]: secondary_bus: 0x17
[ 75.062024] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x347c
[ 75.068637] {1}[Hardware Error]: class_code: 000406
[ 75.073689] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 75.081427] {1}[Hardware Error]: aer_uncor_status: 0x00002000, aer_uncor_mask: 0x01310000
[ 75.089773] {1}[Hardware Error]: aer_uncor_severity: 0x044ef030
[ 75.095869] {1}[Hardware Error]: TLP Header: ffffffff ffffffff ffffffff ffffffff
[ 75.103456] pcieport 0000:16:04.0: AER: aer_status: 0x00002000, aer_mask: 0x01310000
[ 75.111208] pcieport 0000:16:04.0: [13] FCP (First)
[ 75.117999] pcieport 0000:16:04.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 75.126256] pcieport 0000:16:04.0: AER: aer_uncor_severity: 0x044ef030
[ 75.132782] pci 0000:17:00.0: AER: can't recover (no error_detected callback)
[ 75.139914] pci 0000:17:00.1: AER: can't recover (no error_detected callback)
[ 76.218664] pcieport 0000:16:04.0: AER: Root Port link has been reset (0)
[ 76.225473] pcieport 0000:16:04.0: AER: device recovery failed
[ 76.276184] mlx5_core 0000:17:00.0: firmware version: 16.35.3006
[ 76.282231] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 76.300373] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 949): enable hca failed
[ 76.326060] mlx5_core 0000:17:00.0: probe_one:1499:(pid 949): mlx5_init_one failed with error code -5
[ 76.335552] mlx5_core: probe of 0000:17:00.0 failed with error -5
[ 76.444169] mlx5_core 0000:17:00.0: firmware version: 16.35.3006
[ 76.450208] mlx5_core 0000:17:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 76.467889] mlx5_core 0000:17:00.0: mlx5_function_setup:1028:(pid 949): enable hca failed
[ 76.493764] mlx5_core 0000:17:00.0: probe_one:1499:(pid 949): mlx5_init_one failed with error code -5
[ 76.503223] mlx5_core: probe of 0000:17:00.0 failed with error -5
Expected results
reset successed
Actual results
reset failed