Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27227

Dynamic irq load balancing issues (4.15)

XMLWordPrintable

    • Important
    • No
    • CNF Compute Sprint 249
    • 1
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-25699. The following is the description of the original issue:

      Description of problem:

      If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"
      
      We have found a number of issues with this: 
      
      1) The script  clear-irqbalance-banned-cpus.sh is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent.
      If a guaranteed pod with the  irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores. 
      This script needs to set the banned mask to 0’s on startup.
      
      2) The more serious issue, the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. 
      
      Scenarios
      - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok 
      - On a reboot if tuned recovers after the guaranteed pod has been launched - broken
      - If tuned restarts at runtime for any reason - broken
      
      3) Lastly the crio restore of the irqbalance mask needs to be removed. Disabling this should be part of the crio conf that is installed by the NTO.

      Version-Release number of selected component (if applicable):

         4.14 and likely earlier

      How reproducible:

          See description

      Steps to Reproduce:

          1.See description 
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

       

              titzhak Talor Itzhak
              openshift-crt-jira-prow OpenShift Prow Bot
              SARGUN NARULA SARGUN NARULA
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: