Description of problem:
PF reset happend for Mellanox ConnectX-6 MT2892 causing VFs to detach from workload pods. The same PF reset can be seen on all the nodes
Version-Release number of selected component (if applicable):
SR-IOV operator on OCP v4.16
How reproducible:
Customer is trying to find the reproducer steps
Steps to Reproduce:
N/A
Actual results:
The PF reset causing VF to be down & below can be seen in dmesg
$ cat sos_commands/kernel/dmesg ... [2565093.192508] mlx5_core 0000:2a:0b.3 ens1f1v26: renamed from eth0 [2565094.742382] mlx5_core 0000:2a:0b.3: enabling device (0000 -> 0002) [2565094.742508] mlx5_core 0000:2a:0b.3: firmware version: 22.35.3006 [2565095.080216] mlx5_core 0000:2a:0b.3: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps [2565095.082999] irq 1730: Affinity broken due to vector space exhaustion. [2565095.083029] irq 1731: Affinity broken due to vector space exhaustion. [2565095.083049] irq 1732: Affinity broken due to vector space exhaustion. [2565095.272172] mlx5_core 0000:2a:0b.3: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) [2565095.285954] mlx5_core 0000:2a:0b.3 ens1f1v26: renamed from eth0 [2565096.844456] mlx5_core 0000:2a:0b.4: enabling device (0000 -> 0002) [2565096.844588] mlx5_core 0000:2a:0b.4: firmware version: 22.35.3006 [2565097.185187] mlx5_core 0000:2a:0b.4: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps [2565097.188027] irq 1742: Affinity broken due to vector space exhaustion. [2565097.188062] irq 1743: Affinity broken due to vector space exhaustion. [2565097.188088] irq 1744: Affinity broken due to vector space exhaustion.
Expected results:
The PF reset shouldn't be triggered unless done manually.
Additional info:
1) The timestamp is Dec 11 10:33. 2) Also, when ZTE manully triggered reset of PF using mlxfwreset utility the same symtoms are seen which confirm the unexpected PF reset had occured. 3) As of now, ZTE is trying to find the reproducer steps & share the exact logs during the above mentioned time.