Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-51057

[4.18] Running 10 pods with zero-packet-loss annotation crashes irqbalance on a vanilla OCP node

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.19.0
    • 4.16.z
    • Node Tuning Operator
    • None

      Description of problem:

      With OCP vanilla installation, we hit a bug when running a deployment of 10 pods with irq-load-balancing.crio.io: "disable" annotation.
      The bug: irqbalance restarts several times and dies, complaining about "Start request repeated too quickly. irqbalance.service: Failed with result 'start-limit-hit'".
      One pod is fine but 10 pods kill irqbalance on the worker node.

      Version-Release number of selected component (if applicable):

          4.16.z

      How reproducible:

          

      Steps to Reproduce:

      1. systemd irqbalance running on the worker node, as in the default non-modified installation
      
      [root@worker-3 ~]# systemctl status irqbalance
      ● irqbalance.service - irqbalance daemon
           Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; preset: enabled)
           Active: active (running) since Tue 2024-11-26 16:23:56 UTC; 1min 29s ago
      
      2. apply performance profile as recommended
      
      ---
      kind: PerformanceProfile
      apiVersion: "performance.openshift.io/v2"
      metadata:
        name: blueprint-profile
      spec:
        cpu:
          isolated: "1-19,21-39,41-59,61-79"
          reserved: "0,40,20,60"
        additionalKernelArgs:
          - nohz_full=1-19,21-39,41-59,61-79
        hugepages:
          pages:
            - size: "1G"
              count: 32
              node: 0
            - size: "1G"
              count: 32
              node: 1
            - size: "2M"
              count: 12000
              node: 0
            - size: "2M"
              count: 12000
              node: 1
        realTimeKernel:
          enabled: false
        workloadHints:
          realTime: false
          highPowerConsumption: false
          perPodPowerManagement: true
        net:
          userLevelNetworking: false
        numa:
          topologyPolicy: "single-numa-node"
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      ...
      
      
      It will generate child tuned object:
      
      [kni@provisioner.cluster5.dfwt5g.lab ~]$ oc describe tuned/openshift-node-performance-blueprint-profile -n openshift-cluster-node-tuning-operator
      -- snip --
      [irqbalance]
      # Disable the plugin entirely, which was enabled by the parent profile `cpu-partitioning`.
      # It can be racy if TuneD restarts for whatever reason.
      #> cpu-partitioning
      enabled=false
      -- snip --   
      
      
      3. Run deployment with 10 pods having "irq-load-balancing.crio.io: "disable"" annotation
      
      $ cat ten_pods.yml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: tania-test-deployment
        labels:
          app: tania-test
      spec:
        replicas: 10
        selector:
          matchLabels:
            app: tania-test
        template:
          metadata:
            name: tania-test-pod
            annotations:
              irq-load-balancing.crio.io: "disable"
            labels:
              app: tania-test
          spec:
            nodeName: worker-3
            runtimeClassName: performance-blueprint-profile
            containers:
            - name: tania-test-pod
              image: registry.dfwt5g.lab:4443/chart/nginx-118
              command: ["sleep", "INF"]
              resources:
                limits:
                  hugepages-1Gi: 2Gi
                  cpu: "8"
                  memory: 1000Mi  
      
      4. irqbalance dies on the worker
      
      Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:50 worker-3 systemd[1]: Started irqbalance daemon.
      Nov 26 16:32:50 worker-3 /usr/sbin/irqbalance[55815]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead
      Nov 26 16:32:50 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:50 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:50 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: Started irqbalance daemon.
      Nov 26 16:32:51 worker-3 /usr/sbin/irqbalance[56025]: IRQBALANCE_BANNED_CPUS is discarded, Use IRQBALANCE_BANNED_CPULIST instead
      Nov 26 16:32:51 worker-3 systemd[1]: Stopping irqbalance daemon...
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Deactivated successfully.
      Nov 26 16:32:51 worker-3 systemd[1]: Stopped irqbalance daemon.
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Start request repeated too quickly.
      Nov 26 16:32:51 worker-3 systemd[1]: irqbalance.service: Failed with result 'start-limit-hit'.
      Nov 26 16:32:51 worker-3 systemd[1]: Failed to start irqbalance daemon.

      Actual results:

          irqbalance dies on the worker

      Expected results:

          irqbalance should not die

      Additional info:

          

              team-nto Team NTO
              tkrishto@redhat.com Tatiana Krishtop
              Mallapadi Niranjan Mallapadi Niranjan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: