Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32251

Busy loop thread running on an isolated core being pre-empted by irq_work/CPUn

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Undefined
    • None
    • 4.14
    • None
    • No
    • False
    • Hide

      None

      Show
      None
    • 2024-04-17: Investigation is ongoing

    Description

      Description of problem:

      I have a testpmd pod running on an isolated core on a system that has workload partitioning enabled with CPU3 being part of the isolated cores.
      The packet forwarding thread (rte-worker-3) on the tespmd process (pid 2570750) is running on CPU3 and its threadID is 2570754.  It is running as a busy loop with scheduling policy: SCHED_FIFO and a scheduling priority: 1 so it should not be interrupted on the isolated CPU3.
      However, running function_graph trace on CPU3 shows us that the testpmd forwarding thread has been interrupted multiple times by the irq_wor-46 process. 
      
      3)   0.419 us    |          save_fpregs_to_fpstate();
      ------------------------------------------  
      3) rte-wor-2570754 =>   irq_wor-46   
      ------------------------------------------    
      3)               |          finish_task_switch.isra.0() {  
      3)               |            vtime_task_switch_generic() {
      
      
      Here is the scheduling stats for that thread:
      
      #### /proc/2570750/task/2570754/sched 
      rte-worker-3 (2570754, #threads: 8)
      -------------------------------------------------------------------
      se.exec_start                                :     551543727.283270
      se.vruntime                                  :             0.000000
      se.sum_exec_runtime                          :       7103349.966615
      se.nr_migrations                             :                    1
      nr_switches                                  :                   75
      nr_voluntary_switches                        :                    2
      nr_involuntary_switches                      :                   73
      se.load.weight                               :              1048576
      se.avg.load_sum                              :                47295
      se.avg.runnable_sum                          :             48430080
      se.avg.util_sum                              :             48430080
      se.avg.load_avg                              :                 1024
      se.avg.runnable_avg                          :                 1024
      se.avg.util_avg                              :                 1024
      se.avg.last_update_time                      :      544440981193728
      se.avg.util_est.ewma                         :                    1
      se.avg.util_est.enqueued                     :                    1
      policy                                       :                    1
      prio                                         :                   98
      clock-delta                                  :                   43
      #### 
      
      We can see that nr_involuntary_switches is 73.
      
      The irq_work/CPUn thread looks like it has been introduced in this upstream patch [1] and in this rhel9 patch [2].
      
      [1]: https://github.com/torvalds/linux/commit/b4c6f86ec2f648b5e6d4b04564fbc6d5351160a8
      [2]: https://gitlab.com/redhat/rhel/src/kernel/rhel-9/-/commit/62014d41db107099b22b77b5eb0011d5ba07df1b

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1. Deploy an SNO cluster with DU profile
          2. Run a testpmd pod
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

      Attachments

        Activity

          People

            msivak@redhat.com Martin Sivak
            dosman@redhat.com Dahir Osman
            Gowrishankar Rajaiyan Gowrishankar Rajaiyan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: