[OCPBUGS-51057] [4.18] Running 10 pods with zero-packet-loss annotation crashes irqbalance on a vanilla OCP node - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.19.0
Affects Version/s: 4.16.z
Component/s: Node Tuning Operator
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Latest Status Summary:
2025/01/07 - YELLOW - A chicken vs. egg in upstream merge process. See https://github.com/cri-o/cri-o/pull/8834#issuecomment-2550625521
RH Private Keywords:
Target Version:

4.18.z
Target Backport Versions:

4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

With OCP vanilla installation, we hit a bug when running a deployment of 10 pods with irq-load-balancing.crio.io: "disable" annotation.
The bug: irqbalance restarts several times and dies, complaining about "Start request repeated too quickly. irqbalance.service: Failed with result 'start-limit-hit'".
One pod is fine but 10 pods kill irqbalance on the worker node.

Version-Release number of selected component (if applicable):

    4.16.z

How reproducible:

Steps to Reproduce:

1. systemd irqbalance running on the worker node, as in the default non-modified installation

[root@worker-3 ~]# systemctl status irqbalance
● irqbalance.service - irqbalance daemon
     Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-11-26 16:23:56 UTC; 1min 29s ago

2. apply performance profile as recommended

---
kind: PerformanceProfile
apiVersion: "performance.openshift.io/v2"
metadata:
  name: blueprint-profile
spec:
  cpu:
    isolated: "1-19,21-39,41-59,61-79"
    reserved: "0,40,20,60"
  additionalKernelArgs:
    - nohz_full=1-19,21-39,41-59,61-79
  hugepages:
    pages:
      - size: "1G"
        count: 32
        node: 0
      - size: "1G"
        count: 32
        node: 1
      - size: "2M"
        count: 12000
        node: 0
      - size: "2M"
        count: 12000
        node: 1
  realTimeKernel:
    enabled: false
  workloadHints:
    realTime: false
    highPowerConsumption: false
    perPodPowerManagement: true
  net:
    userLevelNetworking: false
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker: ""
...


It will generate child tuned object:

[kni@provisioner.cluster5.dfwt5g.lab ~]$ oc describe tuned/openshift-node-performance-blueprint-profile -n openshift-cluster-node-tuning-operator
-- snip --
[irqbalance]
# Disable the plugin entirely, which was enabled by the parent profile `cpu-partitioning`.
# It can be racy if TuneD restarts for whatever reason.
#> cpu-partitioning
enabled=false
-- snip --   


3. Run deployment with 10 pods having "irq-load-balancing.crio.io: "disable"" annotation

$ cat ten_pods.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tania-test-deployment
  labels:
    app: tania-test
spec:
  replicas: 10
  selector:
    matchLabels:
      app: tania-test
  template:
    metadata:
      name: tania-test-pod
      annotations:
        irq-load-balancing.crio.io: "disable"
      labels:
        app: tania-test
    spec:
      nodeName: worker-3
      runtimeClassName: performance-blueprint-profile
      containers:
      - name: tania-test-pod
        image: registry.dfwt5g.lab:4443/chart/nginx-118
        command: ["sleep", "INF"]
        resources:
          limits:
            hugepages-1Gi: 2Gi
            cpu: "8"
            memory: 1000Mi  

4. irqbalance dies on the worker

Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:50 worker-3 systemd[1]: Started irqbalance daemon.
Nov 26 16:32:50 worker-3 /usr/sbin/irqbalance[55815]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead
Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
Nov 26 16:32:51 worker-3 /usr/sbin/irqbalance[56025]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead
Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Start request repeated too quickly.
Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Failed with result 'start-limit-hit'.
Nov 26 16:32:51 worker-3 systemd[1]: Failed to start irqbalance daemon.

Actual results:

    irqbalance dies on the worker

Expected results:

    irqbalance should not die

Additional info:

clones

OCPBUGS-45112 Running 10 pods with zero-packet-loss annotation crashes irqbalance on a vanilla OCP node

Verified

links to

RHBA-2025:1904 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Team NTO

Reporter:: Tatiana Krishtop

QA Contact:: Mallapadi Niranjan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/02/19 4:47 PM

Updated:: 2025/03/04 5:11 PM

Resolved:: 2025/03/04 5:11 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide