Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27227

Dynamic irq load balancing issues (4.15)


    • Important
    • No
    • CNF Compute Sprint 249
    • 1
    • False
    • Hide



      This is a clone of issue OCPBUGS-25699. The following is the description of the original issue:

      Description of problem:

      If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"
      We have found a number of issues with this: 
      1) The script  clear-irqbalance-banned-cpus.sh is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent.
      If a guaranteed pod with the  irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores. 
      This script needs to set the banned mask to 0’s on startup.
      2) The more serious issue, the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. 
      - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok 
      - On a reboot if tuned recovers after the guaranteed pod has been launched - broken
      - If tuned restarts at runtime for any reason - broken
      3) Lastly the crio restore of the irqbalance mask needs to be removed. Disabling this should be part of the crio conf that is installed by the NTO.

      Version-Release number of selected component (if applicable):

         4.14 and likely earlier

      How reproducible:

          See description

      Steps to Reproduce:

          1.See description 

      Actual results:


      Expected results:


      Additional info:



            titzhak Talor Itzhak
            openshift-crt-jira-prow OpenShift Prow Bot
            0 Vote for this issue
            10 Start watching this issue