-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.12, 4.14, 4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
All megasas irqs are not affined to the reserved cores after the performance profile has been applied.
In this case the reserved cores are 94-95,190-191
[root@worker-0 ~]# ls /proc/irq/153
affinity_hint effective_affinity effective_affinity_list megasas0-msix80 node smp_affinity smp_affinity_list spurious
[root@worker-0 ~]# cat /proc/irq/153/smp_affinity_list
72
If I try and manually affine, the following error occurs
[root@worker-0 ~]# echo "88-95,184-191" > /proc/irq/153/smp_affinity_list
-bash: echo: write error: Input/output error
Version-Release number of selected component (if applicable):
Probably in all versions of Openshift, for now tested in OCP 4.12, 4.14, and 4.18
How reproducible:
100%
Steps to Reproduce:
1. Apply performance profile
2. After it has been applied query the affinity of the megasas irqs
3. Try to change the smp_affinity_list as described above
Actual results:
Megasas irqs are not affined to the reserved cores.
Expected results:
Megasas irqs are affined to the reserved cores.
Additional info:
Validated in the following hardware:
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7615
Processor Information
Socket Designation: CPU1
Type: Central Processor
Family: Zen
Manufacturer: AMD
ID: 11 0F A1 00 FF FB 8B 17
Signature: Family 25, Model 17, Stepping 1
Version: AMD EPYC 9654P 96-Core Processor
Core Count: 96
Core Enabled: 96
Thread Count: 192
[core@worker-0 ~]$ sudo lspci -v | less
41:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx
DeviceName: SL1 RAID
Subsystem: Dell PERC H755N Front
Flags: bus master, fast devsel, latency 0, IRQ 72, NUMA node 0, IOMMU group 17
Memory at 90000000 (64-bit, prefetchable) [size=1M]
Memory at 90100000 (64-bit, prefetchable) [size=1M]
Memory at a4000000 (32-bit, non-prefetchable) [size=1M]
I/O ports at 4000 [size=256]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=128 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Power Budgeting <?>
Capabilities: [158] Alternative Routing-ID Interpretation (ARI)
Capabilities: [168] Secondary PCI Express
Capabilities: [188] Physical Layer 16.0 GT/s <?>
Capabilities: [1b0] Lane Margining at the Receiver <?>
Capabilities: [248] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [380] Data Link Feature <?>
Kernel driver in use: megaraid_sas
Kernel modules: megaraid_sas
Here we can see how megasas are using the isolated CPUS, lots of them:
$ oc get performanceprofile -o yaml | head -30
apiVersion: v1
items:
- apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: blueprint-profile
spec:
additionalKernelArgs:
- nohz_full=0-93,96-189
cpu:
isolated: 0-93,96-189
reserved: 94-95,190-191
$ CPUMAX=`cat /proc/cpuinfo | grep processor | tail -n 1 | egrep -o [0-9]*$`
$ echo === NAME of IRQs for every CPU ===
$ for C in `seq 0 $CPUMAX` ; do
echo -n CPU${C}:
IRQS=`grep -H ${C} /proc/irq/*/effective_affinity_list | grep :${C}$ | cut -f 4 -d '/'`
for I in $IRQS ; do
IRQNAME=`cat /proc/interrupts | grep \ ${I}\: | awk '{print $(NF)}'`
echo -n " "${IRQNAME}
done
echo
done
=== NAME of IRQs for every CPU ===
CPU0: timer
CPU1:
CPU2: AMD-Vi
CPU3: AMD-Vi
CPU4: AMD-Vi
CPU5: AMD-Vi
...
CPU71:
CPU72: megasas0-msix80
CPU73: megasas0-msix81
CPU74: megasas0-msix82
CPU75: megasas0-msix83
CPU76: megasas0-msix84
CPU77: megasas0-msix85
CPU78: megasas0-msix86
CPU79: megasas0-msix87
CPU80: megasas0-msix88
CPU81: megasas0-msix89
CPU82: megasas0-msix90
CPU83: megasas0-msix91
CPU84: megasas0-msix92
CPU85: megasas0-msix93
CPU86: megasas0-msix94
CPU87: megasas0-msix95
CPU88: megasas0-msix96 mlx5_comp1@pci:0000:81:00.0
...
CPU96: megasas0-msix8
CPU97: megasas0-msix9
CPU98: megasas0-msix10
CPU99: megasas0-msix11
CPU100: megasas0-msix12
CPU101: megasas0-msix13
CPU102: megasas0-msix14
CPU103: megasas0-msix15
CPU104: megasas0-msix16
CPU105: megasas0-msix17
CPU106: megasas0-msix18
CPU107: megasas0-msix19
CPU108: megasas0-msix20
CPU109: megasas0-msix21
CPU110: megasas0-msix22
CPU111: megasas0-msix23
CPU112: megasas0-msix24
CPU113: megasas0-msix25
CPU114: megasas0-msix26
CPU115: megasas0-msix27
CPU116: megasas0-msix28
CPU117: megasas0-msix29
CPU118: megasas0-msix30
CPU119: megasas0-msix31
CPU120: megasas0-msix32
CPU121: megasas0-msix33
CPU122: megasas0-msix34
CPU123: megasas0-msix35
CPU124: megasas0-msix36
CPU125: megasas0-msix37
CPU126: megasas0-msix38
CPU127: megasas0-msix39
CPU128: megasas0-msix40
CPU129: megasas0-msix41
CPU130: megasas0-msix42
CPU131: megasas0-msix43
CPU132: megasas0-msix44
CPU133: megasas0-msix45
CPU134: megasas0-msix46
CPU135: megasas0-msix47
CPU136: megasas0-msix48
CPU137: megasas0-msix49
CPU138: megasas0-msix50
CPU139: megasas0-msix51
CPU140: megasas0-msix52
CPU141: megasas0-msix53
CPU142: megasas0-msix54
CPU143: megasas0-msix55
CPU144: megasas0-msix56
CPU145: megasas0-msix57
CPU146: megasas0-msix58
CPU147: megasas0-msix59
CPU148: megasas0-msix60
CPU149: megasas0-msix61
CPU150: megasas0-msix62
CPU151: megasas0-msix63
CPU152: megasas0-msix64
CPU153: megasas0-msix65
CPU154: megasas0-msix66
CPU155: megasas0-msix67
CPU156: megasas0-msix68
CPU157: megasas0-msix69
CPU158: megasas0-msix70
CPU159: megasas0-msix71
CPU160: megasas0-msix72
CPU161: megasas0-msix73
CPU162: megasas0-msix74
CPU163: megasas0-msix75
CPU164: megasas0-msix76
CPU165: megasas0-msix77
CPU166: megasas0-msix78
CPU167: megasas0-msix79
CPU168: megasas0-msix104
CPU169: megasas0-msix105
CPU170: megasas0-msix106
CPU171: megasas0-msix107
CPU172: megasas0-msix108
CPU173: megasas0-msix109
CPU174: megasas0-msix110
CPU175: megasas0-msix111
CPU176: megasas0-msix112
CPU177: megasas0-msix113
CPU178: megasas0-msix114
CPU179: megasas0-msix115
CPU180: megasas0-msix116
CPU181: megasas0-msix117
CPU182: megasas0-msix118
CPU183: megasas0-msix119
CPU184: megasas0-msix120
We even tested this work-around to set smp_affinity_enable=0, but we obtained the same results: https://www.suse.com/support/kb/doc/?id=000021663
$ oc get performanceprofile blueprint-profile -o json | jq .spec.additionalKernelArgs [ "nohz_full=0-93,96-189", "smp_affinity_enable=0"
We found similar bug from OCP 4.6 https://bugzilla.redhat.com/show_bug.cgi?id=1908944 we are wondering if this could also be HW related.
4.12 SOSReport can be found here: sosreport-worker-0-2025-06-17-oveuywj.tar.xz