-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.14, 4.15, 4.16, 4.17, 4.18, 4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Within the go runtime, the usleep ABI contains an error that affects the calculation of the sleep time by a factor of 1000. This error leads to a massive increase how often nanosleep function is called and a resulting high increase in the number of syscalls made. Since usleep is the basis of many core functions of go, such as sysmon, and is generally used very frequently, the CPU consumption of all go-based programs can be correspondingly high. In heavily over-committed environments, the error can even lead to instability of clusters. For kubelet in OCP, for example, this means an increase in calls to the nanosleep function by a factor of 5-7. As many other operators in OCP make use of go runtime, the effect should be visible in many operators. Observed behavior: [pid 14431] nanosleep({tv_sec=0, tv_nsec=3}, <unfinished ...> [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, <unfinished ...> [pid 14431] <... nanosleep resumed>NULL) = 0 [pid 14428] <... nanosleep resumed>NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=20}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=1280}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=2560}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=5120}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 [pid 14428] nanosleep({tv_sec=0, tv_nsec=10000}, NULL) = 0 Correct behavior: [pid 49563] nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=320000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=640000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=1280000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=2560000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=5120000}, NULL) = 0 [pid 49563] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0 We have already a bugfix created which is already merged: https://github.com/golang/go/issues/71714 Is there a chance to get an ART build with the fix to evaluate if and how much it helps to reduce the OCP on s390x footprint?
Version-Release number of selected component (if applicable):
Any s390x OCP version
How reproducible:
Use strace to get the number and parameters of nanosleep used in kubelet
Steps to Reproduce:
1. Install any OCP on s390x version 2. Open debug container on a control node or ssh into a control node 3. Run: sudo timeout 10s strace -o tmp --trace=nanosleep -f -p $(ps ax | grep /usr/bin/kubelet | awk '{print $1}' | head -n1) 4. Have a look at tmp: - wc -l tmp # outputs the number of nanosleep calls - tmp contains wrongly defined tv_nsec of between 3-10000ns which is a factor of 1000x too less
Actual results:
Expected results:
Additional info:
- is cloned by
-
OCPBUGS-54874 Highly increased syscalls when using the usleep function in all go-based operators
-
- Closed
-
- links to
-
RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update