Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45520

kubelet high C2C contention

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.14.z
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • x86_64
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Profiling (using perf) a bare metal worker node we have noticed high cpu cache contention

      Version-Release number of selected component (if applicable):

          4.14.31

      How reproducible:

          At the customer cluster during load testing

      Steps to Reproduce:

          1.Run customer performance test
          2.run perf to capture the data
          3.
          

      Actual results:

      There is a very hot cacheline related to kubelet which we can't identify figure to what it might be related (might be customer's doing too)

      Expected results:

       Reduce hot cacheline contention  

      Additional info:

        Customer running an application performance test using large bare metal nodes of 96 cpus (48x2)  
      During the investigation of the application (Springboot running tomcat) we collected perf c2c data 
      https://issues.redhat.com/browse/MPL-673 
      Analysis https://issues.redhat.com/browse/MPL-673?focusedId=26190484&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26190484 
      
      ~~~
      There were three of those cachelines were really hot, with some instruction load latencies up in the 10s of milliseconds. Accesses were from all or nearly all cpus across both numa nodes. Of those three, two were from http-nio-8080 JIT /perf/tmp-<pid> code. The third was from kubelet, and that was by far the worse in terms of load latencies (with the averages for load latencies for instructions involved in that contention falling in the 30K-50K machine cycles to complete [retire the instruction]. That's really hot.)
      ~~~
      
      Trying to find out if there is an option to figure out what is it - whether it is related to some customer made setup / application or a kubelet issue?
      
      The perf collection (analyzed by Joe mario is available on the above mentioned Jira 
      
      I had the customer collect during load today (though I can't tell for sure there was this cpu hot cacheline happening - though load was similar to th previous perf collection)
      
      kubelet trace, cpu pprof, heap pprof - hopping it might show the issue / capturing it

              rphillip@redhat.com Ryan Phillips
              rhn-support-igreen Ilan Green
              Ryan Phillips
              None
              Cameron Meadors Cameron Meadors
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: