Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31442

Dynamic irq load balancing issues

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Undefined
    • None
    • 4.15
    • None
    • Important
    • No
    • CNF Compute Sprint 252, CNF Compute Sprint 253
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      *Cause*: An underlying script "clear-irqbalance-banned-cpus.sh" (launched by a systemd service on the node) is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent.
      If a guaranteed pod with the irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores.
      This script needs to set the banned mask to 0’s on startup.

      *Consequence*: If GloballyDisableIrqLoadBalancing is set to “true” in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"
      On reboots or tuned pods restart this behavior can break having irq’s balanced over the reserved cpus only.

      *Fix*: This script needs to set the banned mask to 0’s on startup.

      *Result*: restores user intent to have irqs balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" where globallyDisableIrqLoadBalancing=true set in the performance profile.
      Show
      *Cause*: An underlying script "clear-irqbalance-banned-cpus.sh" (launched by a systemd service on the node) is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent. If a guaranteed pod with the irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores. This script needs to set the banned mask to 0’s on startup. *Consequence*: If GloballyDisableIrqLoadBalancing is set to “true” in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" On reboots or tuned pods restart this behavior can break having irq’s balanced over the reserved cpus only. *Fix*: This script needs to set the banned mask to 0’s on startup. *Result*: restores user intent to have irqs balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" where globallyDisableIrqLoadBalancing=true set in the performance profile.
    • Bug Fix
    • In Progress

    Description

      This is a clone of issue OCPBUGS-31357. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-30980. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-27227. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-25699. The following is the description of the original issue:

      Description of problem:

      If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"
      
      We have found a number of issues with this: 
      
      1) The script  clear-irqbalance-banned-cpus.sh is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent.
      If a guaranteed pod with the  irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores. 
      This script needs to set the banned mask to 0’s on startup.
      
      2) The more serious issue, the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. 
      
      Scenarios
      - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok 
      - On a reboot if tuned recovers after the guaranteed pod has been launched - broken
      - If tuned restarts at runtime for any reason - broken
      
      3) Lastly the crio restore of the irqbalance mask needs to be removed. Disabling this should be part of the crio conf that is installed by the NTO.

      Version-Release number of selected component (if applicable):

         4.14 and likely earlier

      How reproducible:

          See description

      Steps to Reproduce:

          1.See description 
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

       

      Attachments

        Activity

          People

            yquinn@redhat.com Yanir Quinn
            openshift-crt-jira-prow OpenShift Prow Bot
            SARGUN NARULA SARGUN NARULA
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: