Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18, 4.19
Component/s: Node / CRI-O
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

OpenShift clusters deployed on hosts with a very high core count (e.g., 384-512+ SMT threads) are experiencing severe and intermittent CPU utilization spikes across various core Go-based applications and services (e.g., ovn-kube, kubelet).

These spikes are attributable to the default behavior of the Go runtime's Garbage Collector (GC) when running without an explicitly set GOMAXPROCS environment variable.
Go assumes with the lack of a GOMAXPROCS that threads = number of cores, which is a good assumption on a bare metal environment.
In a containerized environment(Openshift), having this set to a default of the maximum cores increases overhead, and potential for failures that are unexpected. In current hardware the number of cores can be greater than 400, with in the near future being more than a 1000.

The customer initially observed excessive resource consumption and inefficient operation from the Go Garbage Collector within the kubelet component. This known issue is currently mitigated by applying a Performance Profile that sets the kubelet's CPU affinity (refer to RFE-7881).

The core concern of this bug is the issue affecting ovn-kube and other general OpenShift containerized applications. These components cannot be mitigated through the current Performance Profile approach.

We require a systemic solution to ensure that core OpenShift components using the Go runtime are correctly configured with an appropriate GOMAXPROCS value rather than the physical host's core count.

Version-Release number of selected component (if applicable):

OpenShift 4.18
OpenShift 4.19
OpenShift 4.20

How reproducible:

The issue is reproducible on nodes exceeding 128 CPU cores, with the severity of the performance peaks being directly proportional to the node's core count.

Steps to Reproduce:

1.Deploy OpenShift cluster on high-core-count hardware (128+ cores)   

2. Check thread count for Go applications.

3. Observe CPU performance directly via pidstat or some other means:
  $ pidstat -p $PID 1 | awk 'NR>3 {print $1,$8}'
  06:34:13 18.00
  06:34:14 17.00
  06:34:15 12.00
  06:34:16 12.00
  06:34:17 4416.00
  06:34:18 14.00
  06:34:19 14.00

Actual results:

Significant CPU utilization peaks, with significant thread counts for Go based applications.

Expected results:

CPU utilization should remain stable on a system, regardless of the number of cores enabled on the hardware.

Assignee:: Ryan Phillips

Reporter:: Stephanie Sierra

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/11/03 10:38 PM

Updated:: 2025/11/04 8:20 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates