-
Bug
-
Resolution: Done
-
Critical
-
4.9.z
-
None
-
None
-
CNF Compute Sprint 227, CNF Compute Sprint 228
-
2
-
Proposed
-
False
-
-
Description of problem:
We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.
The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.
While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.
From the logs, we see slow container creation times.
I added some instrumentation to the low_latency_hooks.sh script
snip
pid=$(jq '.pid' /dev/stdin 2>&1)
logger "Start low_latency_hooks ${pid}"
[[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
snip
if [ "${mode}" = "ro" ]; then
ip netns exec "${ns}" mount -o remount,ro /sys
[ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
fi
logger "Stop low_latency_hooks ${pid}"
Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.
insecure-readyz
total container create time: 35s
hook time: 29s
cert-syncer
total container create time: 41s
hook time: 32s
cert-regeneration-controller
total container create time: 73s
hook time: 54s
kube-apiserver
total container create time: 18s
hook time: 16s
check-endpoints
total container create time: 31s
hook time: 31s
I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.
Container create times:
insecure-readyz - 1s
cert-syncer - 1s
cert-regeneration-controller - 1s
kube-apiserver -1s
check-endpoints - 5s
I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.
Version-Release number of selected component (if applicable):
4.9.37
Issue likely in later releases as well, have not timed yet
How reproducible:
100%
Steps to Reproduce:
1. Force a rollout with a platform cpu load representative of the application
2.
3.
Actual results:
Slow rollout times sometimes exceeding the timeout
Expected results:
Rollout should fit into the timeout window
Additional info:
- depends on
-
OCPBUGS-4043 [2109965] oci hook Low-latency-hooks causing high container creation times under platform cpu load
-
- Closed
-
- is cloned by
-
OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load
-
- Closed
-
- is depended on by
-
OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load
-
- Closed
-
- links to
Since the problem described in this issue should be resolved in a recent advisory, it has been closed.
For information on the advisory, and where to find the updated files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2022:8698