Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-82446

Error 'Cannot change IRQ xxxx affinity: No space left on device' frequently logged

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Normal Normal
    • rhel-9.4.z
    • None
    • irqbalance
    • None
    • None
    • Important
    • ZStream
    • rhel-kernel-debug
    • ssg_core_kernel
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Approved Blocker
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Environment:
      OCP 4.16
      OCP 4.18

      Description of problem:
      Although there is enough CPU, the 'Cannot change IRQ xxxx affinity: No space left on device' message frequently logged.

      The issue resolved after force upgrading to irqbalance-1.9.4-2.el9.

      Actual result:
      A lot of 'No space left on device' messages logged

      Expected result:
      Request for upgrading the irqbalance version in the ostree to avoid the issue.

      Additional info:

      The commit details which included in irqbalance-1.9.4-2.el9.

      commit 54051449030cb3c1642f9a6110316d3705eb3a23
      Author: Andrew Zaborowski <andrew.zaborowski@intel.com>
      Date: Fri May 10 18:57:34 2024 -0700

      Track IRQ "slots" count per CPU to avoid overflowing

      There are situations where irqbalance may try to migrate large numbers of
      IRQs to a topo_obj, there's no upper bound on the number as the
      placement logic is based on load mainly. The kernel's irq bitmasks limit
      the number of IRQs on each cpu and if more are tried to be migrated, the
      write to smp_affinity returns -ENOSPC. This confuses irqbalance's
      logic, the topo_obj.interrupts list no longer matches the irqs actually
      on that CPU or cache domain, and results in floods of error messages.
      See https://github.com/Irqbalance/irqbalance/issues/303 for details.

      For an easy fix, track the number of IRQ slots still free on each CPU.
      We start with INT_MAX meaning "unknown" and when we first get a -ENOSPC,
      we know we have no slots left. From there update the slots count each
      time we migrate IRQs to/from the CPU core topo_obj. We may never see an
      -ENOSPC and in that case there's no change in current logic, we never
      start tracking.

      This way we don't need to know ahead of time how many slots the kernel
      has for each CPU. The number may be arch specific (it is about 200 on
      x86-64) and is dependent on the number managed IRQs kernel has
      registered, so we don't want to guess. This is also more tolerant to
      the topo_obj.interrupts lists not matching exactly the kernel's idea of
      each irq's current affinity, e.g. due to -EIO errors in the smp_affinity
      writes.

      For now only do the tracking at OBJ_TYPE_CPU level so we don't have to
      update slots_left for all parent objs.

      Th commit doesn't try to stop an ongoing activation of all the IRQs
      already scheduled for moving to one cpu, when that cpu starts returning
      ENOSPC. We'll still see a bunch of those errors in that iteration.
      But in subsequent calculate_placement() iterations we avoid assigning
      more IRQs to that cpu than we were able to successfully move before.

              ltao@redhat.com Liu Tao
              rhn-support-yhuang Ying Huang
              Liu Tao Liu Tao
              Michael Nguyen Michael Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: