-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.14.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
x86_64
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Profiling (using perf) a bare metal worker node we have noticed high cpu cache contention
Version-Release number of selected component (if applicable):
4.14.31
How reproducible:
At the customer cluster during load testing
Steps to Reproduce:
1.Run customer performance test
2.run perf to capture the data
3.
Actual results:
There is a very hot cacheline related to kubelet which we can't identify figure to what it might be related (might be customer's doing too)
Expected results:
Reduce hot cacheline contention
Additional info:
Customer running an application performance test using large bare metal nodes of 96 cpus (48x2) During the investigation of the application (Springboot running tomcat) we collected perf c2c data https://issues.redhat.com/browse/MPL-673 Analysis https://issues.redhat.com/browse/MPL-673?focusedId=26190484&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26190484 ~~~ There were three of those cachelines were really hot, with some instruction load latencies up in the 10s of milliseconds. Accesses were from all or nearly all cpus across both numa nodes. Of those three, two were from http-nio-8080 JIT /perf/tmp-<pid> code. The third was from kubelet, and that was by far the worse in terms of load latencies (with the averages for load latencies for instructions involved in that contention falling in the 30K-50K machine cycles to complete [retire the instruction]. That's really hot.) ~~~ Trying to find out if there is an option to figure out what is it - whether it is related to some customer made setup / application or a kubelet issue? The perf collection (analyzed by Joe mario is available on the above mentioned Jira I had the customer collect during load today (though I can't tell for sure there was this cpu hot cacheline happening - though load was similar to th previous perf collection) kubelet trace, cpu pprof, heap pprof - hopping it might show the issue / capturing it