-
Bug
-
Resolution: Done
-
Critical
-
4.9.z
-
None
-
None
-
CNF Compute Sprint 227, CNF Compute Sprint 228
-
2
-
Proposed
-
False
-
-
Description of problem:
We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.
The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.
While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.
From the logs, we see slow container creation times.
I added some instrumentation to the low_latency_hooks.sh script
snip
pid=$(jq '.pid' /dev/stdin 2>&1)
logger "Start low_latency_hooks ${pid}"
[[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
snip
if [ "${mode}" = "ro" ]; then
ip netns exec "${ns}" mount -o remount,ro /sys
[ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
fi
logger "Stop low_latency_hooks ${pid}"
Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.
insecure-readyz
total container create time: 35s
hook time: 29s
cert-syncer
total container create time: 41s
hook time: 32s
cert-regeneration-controller
total container create time: 73s
hook time: 54s
kube-apiserver
total container create time: 18s
hook time: 16s
check-endpoints
total container create time: 31s
hook time: 31s
I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.
Container create times:
insecure-readyz - 1s
cert-syncer - 1s
cert-regeneration-controller - 1s
kube-apiserver -1s
check-endpoints - 5s
I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.
Version-Release number of selected component (if applicable):
4.9.37
Issue likely in later releases as well, have not timed yet
How reproducible:
100%
Steps to Reproduce:
1. Force a rollout with a platform cpu load representative of the application
2.
3.
Actual results:
Slow rollout times sometimes exceeding the timeout
Expected results:
Rollout should fit into the timeout window
Additional info:
- depends on
-
OCPBUGS-4043 [2109965] oci hook Low-latency-hooks causing high container creation times under platform cpu load
- Closed
- is cloned by
-
OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load
- Closed
- is depended on by
-
OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load
- Closed
- links to