-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.12, 4.14, 4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
All megasas irqs are not affined to the reserved cores after the performance profile has been applied. In this case the reserved cores are 94-95,190-191 [root@worker-0 ~]# ls /proc/irq/153 affinity_hint effective_affinity effective_affinity_list megasas0-msix80 node smp_affinity smp_affinity_list spurious [root@worker-0 ~]# cat /proc/irq/153/smp_affinity_list 72 If I try and manually affine, the following error occurs [root@worker-0 ~]# echo "88-95,184-191" > /proc/irq/153/smp_affinity_list -bash: echo: write error: Input/output error
Version-Release number of selected component (if applicable):
Probably in all versions of Openshift, for now tested in OCP 4.12, 4.14, and 4.18
How reproducible:
100%
Steps to Reproduce:
1. Apply performance profile 2. After it has been applied query the affinity of the megasas irqs 3. Try to change the smp_affinity_list as described above
Actual results:
Megasas irqs are not affined to the reserved cores.
Expected results:
Megasas irqs are affined to the reserved cores.
Additional info:
Validated in the following hardware: System Information Manufacturer: Dell Inc. Product Name: PowerEdge R7615 Processor Information Socket Designation: CPU1 Type: Central Processor Family: Zen Manufacturer: AMD ID: 11 0F A1 00 FF FB 8B 17 Signature: Family 25, Model 17, Stepping 1 Version: AMD EPYC 9654P 96-Core Processor Core Count: 96 Core Enabled: 96 Thread Count: 192 [core@worker-0 ~]$ sudo lspci -v | less 41:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx DeviceName: SL1 RAID Subsystem: Dell PERC H755N Front Flags: bus master, fast devsel, latency 0, IRQ 72, NUMA node 0, IOMMU group 17 Memory at 90000000 (64-bit, prefetchable) [size=1M] Memory at 90100000 (64-bit, prefetchable) [size=1M] Memory at a4000000 (32-bit, non-prefetchable) [size=1M] I/O ports at 4000 [size=256] Expansion ROM at <ignored> [disabled] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable+ Count=128 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [148] Power Budgeting <?> Capabilities: [158] Alternative Routing-ID Interpretation (ARI) Capabilities: [168] Secondary PCI Express Capabilities: [188] Physical Layer 16.0 GT/s <?> Capabilities: [1b0] Lane Margining at the Receiver <?> Capabilities: [248] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?> Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?> Capabilities: [380] Data Link Feature <?> Kernel driver in use: megaraid_sas Kernel modules: megaraid_sas
Here we can see how megasas are using the isolated CPUS, lots of them:
$ oc get performanceprofile -o yaml | head -30 apiVersion: v1 items: - apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: blueprint-profile spec: additionalKernelArgs: - nohz_full=0-93,96-189 cpu: isolated: 0-93,96-189 reserved: 94-95,190-191 $ CPUMAX=`cat /proc/cpuinfo | grep processor | tail -n 1 | egrep -o [0-9]*$` $ echo === NAME of IRQs for every CPU === $ for C in `seq 0 $CPUMAX` ; do echo -n CPU${C}: IRQS=`grep -H ${C} /proc/irq/*/effective_affinity_list | grep :${C}$ | cut -f 4 -d '/'` for I in $IRQS ; do IRQNAME=`cat /proc/interrupts | grep \ ${I}\: | awk '{print $(NF)}'` echo -n " "${IRQNAME} done echo done === NAME of IRQs for every CPU === CPU0: timer CPU1: CPU2: AMD-Vi CPU3: AMD-Vi CPU4: AMD-Vi CPU5: AMD-Vi ... CPU71: CPU72: megasas0-msix80 CPU73: megasas0-msix81 CPU74: megasas0-msix82 CPU75: megasas0-msix83 CPU76: megasas0-msix84 CPU77: megasas0-msix85 CPU78: megasas0-msix86 CPU79: megasas0-msix87 CPU80: megasas0-msix88 CPU81: megasas0-msix89 CPU82: megasas0-msix90 CPU83: megasas0-msix91 CPU84: megasas0-msix92 CPU85: megasas0-msix93 CPU86: megasas0-msix94 CPU87: megasas0-msix95 CPU88: megasas0-msix96 mlx5_comp1@pci:0000:81:00.0 ... CPU96: megasas0-msix8 CPU97: megasas0-msix9 CPU98: megasas0-msix10 CPU99: megasas0-msix11 CPU100: megasas0-msix12 CPU101: megasas0-msix13 CPU102: megasas0-msix14 CPU103: megasas0-msix15 CPU104: megasas0-msix16 CPU105: megasas0-msix17 CPU106: megasas0-msix18 CPU107: megasas0-msix19 CPU108: megasas0-msix20 CPU109: megasas0-msix21 CPU110: megasas0-msix22 CPU111: megasas0-msix23 CPU112: megasas0-msix24 CPU113: megasas0-msix25 CPU114: megasas0-msix26 CPU115: megasas0-msix27 CPU116: megasas0-msix28 CPU117: megasas0-msix29 CPU118: megasas0-msix30 CPU119: megasas0-msix31 CPU120: megasas0-msix32 CPU121: megasas0-msix33 CPU122: megasas0-msix34 CPU123: megasas0-msix35 CPU124: megasas0-msix36 CPU125: megasas0-msix37 CPU126: megasas0-msix38 CPU127: megasas0-msix39 CPU128: megasas0-msix40 CPU129: megasas0-msix41 CPU130: megasas0-msix42 CPU131: megasas0-msix43 CPU132: megasas0-msix44 CPU133: megasas0-msix45 CPU134: megasas0-msix46 CPU135: megasas0-msix47 CPU136: megasas0-msix48 CPU137: megasas0-msix49 CPU138: megasas0-msix50 CPU139: megasas0-msix51 CPU140: megasas0-msix52 CPU141: megasas0-msix53 CPU142: megasas0-msix54 CPU143: megasas0-msix55 CPU144: megasas0-msix56 CPU145: megasas0-msix57 CPU146: megasas0-msix58 CPU147: megasas0-msix59 CPU148: megasas0-msix60 CPU149: megasas0-msix61 CPU150: megasas0-msix62 CPU151: megasas0-msix63 CPU152: megasas0-msix64 CPU153: megasas0-msix65 CPU154: megasas0-msix66 CPU155: megasas0-msix67 CPU156: megasas0-msix68 CPU157: megasas0-msix69 CPU158: megasas0-msix70 CPU159: megasas0-msix71 CPU160: megasas0-msix72 CPU161: megasas0-msix73 CPU162: megasas0-msix74 CPU163: megasas0-msix75 CPU164: megasas0-msix76 CPU165: megasas0-msix77 CPU166: megasas0-msix78 CPU167: megasas0-msix79 CPU168: megasas0-msix104 CPU169: megasas0-msix105 CPU170: megasas0-msix106 CPU171: megasas0-msix107 CPU172: megasas0-msix108 CPU173: megasas0-msix109 CPU174: megasas0-msix110 CPU175: megasas0-msix111 CPU176: megasas0-msix112 CPU177: megasas0-msix113 CPU178: megasas0-msix114 CPU179: megasas0-msix115 CPU180: megasas0-msix116 CPU181: megasas0-msix117 CPU182: megasas0-msix118 CPU183: megasas0-msix119 CPU184: megasas0-msix120
We even tested this work-around to set smp_affinity_enable=0, but we obtained the same results: https://www.suse.com/support/kb/doc/?id=000021663
$ oc get performanceprofile blueprint-profile -o json | jq .spec.additionalKernelArgs [ "nohz_full=0-93,96-189", "smp_affinity_enable=0"
We found similar bug from OCP 4.6 https://bugzilla.redhat.com/show_bug.cgi?id=1908944 we are wondering if this could also be HW related.
4.12 SOSReport can be found here: sosreport-worker-0-2025-06-17-oveuywj.tar.xz