[OCPBUGS-3182] [2109965] [4.11] oci hook Low-latency-hooks causing high container creation times under platform cpu load

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.11.z
Affects Version/s: 4.9.z
Component/s: Node Tuning Operator, Performance Addon Operator
Labels:
None

Regression:
None
Sprint:
CNF Compute Sprint 227, CNF Compute Sprint 228
sprint_count:
2
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout.
With this fix, the rollout times should be shorter and within the 5 min threshold.

Show
previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout. With this fix, the rollout times should be shorter and within the 5 min threshold.
Target Version:

4.11.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.

The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.

While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.

From the logs, we see slow container creation times.

I added some instrumentation to the low_latency_hooks.sh script

snip
pid=$(jq '.pid' /dev/stdin 2>&1)
logger "Start low_latency_hooks ${pid}"
[[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
snip
if [ "${mode}" = "ro" ]; then
ip netns exec "${ns}" mount -o remount,ro /sys
[ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
fi
logger "Stop low_latency_hooks ${pid}"

Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.

insecure-readyz
total container create time: 35s
hook time: 29s

cert-syncer
total container create time: 41s
hook time: 32s

cert-regeneration-controller
total container create time: 73s
hook time: 54s

kube-apiserver
total container create time: 18s
hook time: 16s

check-endpoints
total container create time: 31s
hook time: 31s

I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.

Container create times:
insecure-readyz - 1s
cert-syncer - 1s
cert-regeneration-controller - 1s
kube-apiserver -1s
check-endpoints - 5s

I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.

Version-Release number of selected component (if applicable):
4.9.37
Issue likely in later releases as well, have not timed yet

How reproducible:
100%

Steps to Reproduce:
1. Force a rollout with a platform cpu load representative of the application
2.
3.

Actual results:
Slow rollout times sometimes exceeding the timeout

Expected results:
Rollout should fit into the timeout window

Additional info:

depends on

OCPBUGS-4043 [2109965] oci hook Low-latency-hooks causing high container creation times under platform cpu load

Closed

is cloned by

OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load

Closed

is depended on by

OCPBUGS-4033 [2109965] [4.10] oci hook Low-latency-hooks causing high container creation times under platform cpu load

Closed

links to

openshift/cluster-node-tuning-operator#508: [release-4.11] [manual] OCPBUGS-3182: set RPS for veth on host level only

Errata Tool added a comment - 2022/12/05 7:16 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2022:8698

Errata Tool added a comment - 2022/12/05 7:16 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8698

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2022/12/05 7:16 AM

Expand comment: Errata Tool added a comment - 2022/12/05 7:16 AM

People

Dates