-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18, 4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OpenShift clusters deployed on hosts with a very high core count (e.g., 384-512+ SMT threads) are experiencing severe and intermittent CPU utilization spikes across various core Go-based applications and services (e.g., ovn-kube, kubelet). These spikes are attributable to the default behavior of the Go runtime's Garbage Collector (GC) when running without an explicitly set GOMAXPROCS environment variable. Go assumes with the lack of a GOMAXPROCS that threads = number of cores, which is a good assumption on a bare metal environment. In a containerized environment(Openshift), having this set to a default of the maximum cores increases overhead, and potential for failures that are unexpected. In current hardware the number of cores can be greater than 400, with in the near future being more than a 1000. The customer initially observed excessive resource consumption and inefficient operation from the Go Garbage Collector within the kubelet component. This known issue is currently mitigated by applying a Performance Profile that sets the kubelet's CPU affinity (refer to RFE-7881). The core concern of this bug is the issue affecting ovn-kube and other general OpenShift containerized applications. These components cannot be mitigated through the current Performance Profile approach. We require a systemic solution to ensure that core OpenShift components using the Go runtime are correctly configured with an appropriate GOMAXPROCS value rather than the physical host's core count.
Version-Release number of selected component (if applicable):
OpenShift 4.18 OpenShift 4.19 OpenShift 4.20
How reproducible:
The issue is reproducible on nodes exceeding 128 CPU cores, with the severity of the performance peaks being directly proportional to the node's core count.
Steps to Reproduce:
1.Deploy OpenShift cluster on high-core-count hardware (128+ cores)
2. Check thread count for Go applications.
3. Observe CPU performance directly via pidstat or some other means:
$ pidstat -p $PID 1 | awk 'NR>3 {print $1,$8}'
06:34:13 18.00
06:34:14 17.00
06:34:15 12.00
06:34:16 12.00
06:34:17 4416.00
06:34:18 14.00
06:34:19 14.00
Actual results:
Significant CPU utilization peaks, with significant thread counts for Go based applications.
Expected results:
CPU utilization should remain stable on a system, regardless of the number of cores enabled on the hardware.