Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.9
Component/s: Pod Autoscaler
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
None
Blocked Reason:
None
Story Points:
3
Severity:
Moderate
Regression:
None
Architecture:

All

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
If docs needed, set a value
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In larger clusters, we are seeing the VPA webhook admission time increasing. Changes in timeout values in 4.9 help help pod creation not to fail, but the client side throttling is still slowing everything down.

Graphing `sum(rate(apiserver_admission_webhook_admission_duration_seconds_count

{operation="CREATE",rejected="false"}

[1m])) by (name)` we can see that that VPA admission durations spike up between 11 and 31seconds

Looking at both the vpa-admission-plugin and the vpa-updater pods show significant throttling:

2022-05-26T06:09:47.567143786Z I0526 06:09:47.567078 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:k8s.io/autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go:95 (started: 2022-05-26 06:09:37.002034365 +0000 UTC m=+1.813803928) (total time: 10.564999592s):
2022-05-26T06:09:47.567143786Z Trace[911902081]: [10.54507419s] [10.54507419s] Objects listed
2022-05-26T06:09:47.603041102Z I0526 06:09:47.602902 1 fetcher.go:100] Initial sync of ReplicationController completed
2022-05-26T06:09:48.104000169Z I0526 06:09:48.103942 1 fetcher.go:100] Initial sync of Job completed
2022-05-26T06:09:48.206281230Z I0526 06:09:48.204097 1 fetcher.go:100] Initial sync of CronJob completed
2022-05-26T06:09:48.304442123Z I0526 06:09:48.304371 1 fetcher.go:100] Initial sync of DaemonSet completed
2022-05-26T06:09:58.891853970Z I0526 06:09:58.891786 1 trace.go:116] Trace[607811211]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2022-05-26 06:09:48.304811597 +0000 UTC m=+13.116581178) (total time: 10.586932694s):
2022-05-26T06:09:58.891853970Z Trace[607811211]: [10.561769516s] [10.561769516s] Objects listed
2022-05-26T06:45:59.185521417Z I0526 06:45:59.185414 1 request.go:621] Throttling request took 1.198226163s, request: GET:https://10.98.0.1:443/apis/certificates.k8s.io/v1beta1?timeout=32s
2022-05-26T06:46:09.385012980Z I0526 06:46:09.384965 1 request.go:621] Throttling request took 11.397666118s, request: GET:https://10.98.0.1:443/apis/apiserver.openshift.io/v1?timeout=32s
2022-05-26T06:46:19.584660745Z I0526 06:46:19.584606 1 request.go:621] Throttling request took 21.597286629s, request: GET:https://10.98.0.1:443/apis/operators.coreos.com/v2?timeout=32s
2022-05-26T11:49:39.261982690Z I0526 11:49:39.261923 1 request.go:621] Throttling request took 1.198861744s, request: GET:https://10.98.0.1:443/apis/authentication.k8s.io/v1?timeout=32s
2022-05-26T11:49:49.461548564Z I0526 11:49:49.461475 1 request.go:621] Throttling request took 11.39831947s, request: GET:https://10.98.0.1:443/apis/project.openshift.io/v1?timeout=32s
2022-05-26T11:49:59.661101936Z I0526 11:49:59.661037 1 request.go:621] Throttling request took 21.597846712s, request: GET:https://10.98.0.1:443/apis/utils.devops.gov.bc.ca/v1?timeout=32s
2022-05-26T12:00:01.798488449Z I0526 12:00:01.798404 1 request.go:621] Throttling request took 1.198039866s, request: GET:https://10.98.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
2022-05-26T12:00:11.799177479Z I0526 12:00:11.799070 1 request.go:621] Throttling request took 11.198579542s, request: GET:https://10.98.0.1:443/apis/integreatly.org/v1alpha1?timeout=32s
2022-05-26T12:00:21.997400272Z I0526 12:00:21.997333 1 request.go:621] Throttling request took 21.396652936s, request: GET:https://10.98.0.1:443/apis/imageregistry.operator.openshift.io/v1?timeout=32s

Right now VPA seems to use the default 5/10 for qps/burst the could be limiting factor in larger clusters.

Version-Release number of selected component (if applicable):
4.8/4.9

How reproducible:
Pretty reliable in larger clusters

Steps to Reproduce:
1.
2.
3.

Actual results:
VPA admission is quite slow and could cause pod startup to fail before 4.9 timeouts

Expected results:
Admission should be much faster

Additional info:

Assignee:: Joel Smith

Reporter:: Matt Robson

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/06/02 6:39 PM

Updated:: 2025/09/13 5:23 PM

Resolved:: 2023/11/27 7:47 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates