Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Node / Kubelet
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None
Architecture:

x86_64

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Profiling (using perf) a bare metal worker node we have noticed high cpu cache contention

Version-Release number of selected component (if applicable):

    4.14.31

How reproducible:

    At the customer cluster during load testing

Steps to Reproduce:

    1.Run customer performance test
    2.run perf to capture the data
    3.

Actual results:

There is a very hot cacheline related to kubelet which we can't identify figure to what it might be related (might be customer's doing too)

Expected results:

 Reduce hot cacheline contention

Additional info:

  Customer running an application performance test using large bare metal nodes of 96 cpus (48x2)  
During the investigation of the application (Springboot running tomcat) we collected perf c2c data 
https://issues.redhat.com/browse/MPL-673 
Analysis https://issues.redhat.com/browse/MPL-673?focusedId=26190484&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26190484 

~~~
There were three of those cachelines were really hot, with some instruction load latencies up in the 10s of milliseconds. Accesses were from all or nearly all cpus across both numa nodes. Of those three, two were from http-nio-8080 JIT /perf/tmp-<pid> code. The third was from kubelet, and that was by far the worse in terms of load latencies (with the averages for load latencies for instructions involved in that contention falling in the 30K-50K machine cycles to complete [retire the instruction]. That's really hot.)
~~~

Trying to find out if there is an option to figure out what is it - whether it is related to some customer made setup / application or a kubelet issue?

The perf collection (analyzed by Joe mario is available on the above mentioned Jira 

I had the customer collect during load today (though I can't tell for sure there was this cpu hot cacheline happening - though load was similar to th previous perf collection)

kubelet trace, cpu pprof, heap pprof - hopping it might show the issue / capturing it

Assignee:: Ryan Phillips

Reporter:: Ilan Green

Need Info From:: Ryan Phillips

Contributors:: None

QA Contact:: Cameron Meadors

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/12/04 5:23 PM

Updated:: 2025/07/10 11:56 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates