-
Bug
-
Resolution: Done-Errata
-
Major
-
4.16.z
-
None
Description of problem:
With OCP vanilla installation, we hit a bug when running a deployment of 10 pods with irq-load-balancing.crio.io: "disable" annotation. The bug: irqbalance restarts several times and dies, complaining about "Start request repeated too quickly. irqbalance.service: Failed with result 'start-limit-hit'". One pod is fine but 10 pods kill irqbalance on the worker node.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
Steps to Reproduce:
1. systemd irqbalance running on the worker node, as in the default non-modified installation [root@worker-3 ~]# systemctl status irqbalance ● irqbalance.service - irqbalance daemon Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; preset: enabled) Active: active (running) since Tue 2024-11-26 16:23:56 UTC; 1min 29s ago 2. apply performance profile as recommended --- kind: PerformanceProfile apiVersion: "performance.openshift.io/v2" metadata: name: blueprint-profile spec: cpu: isolated: "1-19,21-39,41-59,61-79" reserved: "0,40,20,60" additionalKernelArgs: - nohz_full=1-19,21-39,41-59,61-79 hugepages: pages: - size: "1G" count: 32 node: 0 - size: "1G" count: 32 node: 1 - size: "2M" count: 12000 node: 0 - size: "2M" count: 12000 node: 1 realTimeKernel: enabled: false workloadHints: realTime: false highPowerConsumption: false perPodPowerManagement: true net: userLevelNetworking: false numa: topologyPolicy: "single-numa-node" nodeSelector: node-role.kubernetes.io/worker: "" ... It will generate child tuned object: [kni@provisioner.cluster5.dfwt5g.lab ~]$ oc describe tuned/openshift-node-performance-blueprint-profile -n openshift-cluster-node-tuning-operator -- snip -- [irqbalance] # Disable the plugin entirely, which was enabled by the parent profile `cpu-partitioning`. # It can be racy if TuneD restarts for whatever reason. #> cpu-partitioning enabled=false -- snip -- 3. Run deployment with 10 pods having "irq-load-balancing.crio.io: "disable"" annotation $ cat ten_pods.yml apiVersion: apps/v1 kind: Deployment metadata: name: tania-test-deployment labels: app: tania-test spec: replicas: 10 selector: matchLabels: app: tania-test template: metadata: name: tania-test-pod annotations: irq-load-balancing.crio.io: "disable" labels: app: tania-test spec: nodeName: worker-3 runtimeClassName: performance-blueprint-profile containers: - name: tania-test-pod image: registry.dfwt5g.lab:4443/chart/nginx-118 command: ["sleep", "INF"] resources: limits: hugepages-1Gi: 2Gi cpu: "8" memory: 1000Mi 4. irqbalance dies on the worker Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:50 worker-3 systemd[1]: Started irqbalance daemon. Nov 26 16:32:50 worker-3 /usr/sbin/irqbalance[55815]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon. Nov 26 16:32:51 worker-3 /usr/sbin/irqbalance[56025]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon... Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully. Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon. Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Start request repeated too quickly. Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Failed with result 'start-limit-hit'. Nov 26 16:32:51 worker-3 systemd[1]: Failed to start irqbalance daemon.
Actual results:
irqbalance dies on the worker
Expected results:
irqbalance should not die
Additional info:
- clones
-
OCPBUGS-45112 Running 10 pods with zero-packet-loss annotation crashes irqbalance on a vanilla OCP node
-
- Verified
-
- links to
-
RHBA-2025:1904 OpenShift Container Platform 4.18.z bug fix update