Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9300

VPA vpa-admission-plugin and vpa-updater client throttling causes slow webhook admission

XMLWordPrintable

    • Quality / Stability / Reliability
    • None
    • None
    • 3
    • Moderate
    • None
    • All
    • None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In larger clusters, we are seeing the VPA webhook admission time increasing. Changes in timeout values in 4.9 help help pod creation not to fail, but the client side throttling is still slowing everything down.

      Graphing `sum(rate(apiserver_admission_webhook_admission_duration_seconds_count

      {operation="CREATE",rejected="false"}

      [1m])) by (name)` we can see that that VPA admission durations spike up between 11 and 31seconds

      Looking at both the vpa-admission-plugin and the vpa-updater pods show significant throttling:

      2022-05-26T06:09:47.567143786Z I0526 06:09:47.567078 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:k8s.io/autoscaler/vertical-pod-autoscaler/pkg/target/fetcher.go:95 (started: 2022-05-26 06:09:37.002034365 +0000 UTC m=+1.813803928) (total time: 10.564999592s):
      2022-05-26T06:09:47.567143786Z Trace[911902081]: [10.54507419s] [10.54507419s] Objects listed
      2022-05-26T06:09:47.603041102Z I0526 06:09:47.602902 1 fetcher.go:100] Initial sync of ReplicationController completed
      2022-05-26T06:09:48.104000169Z I0526 06:09:48.103942 1 fetcher.go:100] Initial sync of Job completed
      2022-05-26T06:09:48.206281230Z I0526 06:09:48.204097 1 fetcher.go:100] Initial sync of CronJob completed
      2022-05-26T06:09:48.304442123Z I0526 06:09:48.304371 1 fetcher.go:100] Initial sync of DaemonSet completed
      2022-05-26T06:09:58.891853970Z I0526 06:09:58.891786 1 trace.go:116] Trace[607811211]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:135 (started: 2022-05-26 06:09:48.304811597 +0000 UTC m=+13.116581178) (total time: 10.586932694s):
      2022-05-26T06:09:58.891853970Z Trace[607811211]: [10.561769516s] [10.561769516s] Objects listed
      2022-05-26T06:45:59.185521417Z I0526 06:45:59.185414 1 request.go:621] Throttling request took 1.198226163s, request: GET:https://10.98.0.1:443/apis/certificates.k8s.io/v1beta1?timeout=32s
      2022-05-26T06:46:09.385012980Z I0526 06:46:09.384965 1 request.go:621] Throttling request took 11.397666118s, request: GET:https://10.98.0.1:443/apis/apiserver.openshift.io/v1?timeout=32s
      2022-05-26T06:46:19.584660745Z I0526 06:46:19.584606 1 request.go:621] Throttling request took 21.597286629s, request: GET:https://10.98.0.1:443/apis/operators.coreos.com/v2?timeout=32s
      2022-05-26T11:49:39.261982690Z I0526 11:49:39.261923 1 request.go:621] Throttling request took 1.198861744s, request: GET:https://10.98.0.1:443/apis/authentication.k8s.io/v1?timeout=32s
      2022-05-26T11:49:49.461548564Z I0526 11:49:49.461475 1 request.go:621] Throttling request took 11.39831947s, request: GET:https://10.98.0.1:443/apis/project.openshift.io/v1?timeout=32s
      2022-05-26T11:49:59.661101936Z I0526 11:49:59.661037 1 request.go:621] Throttling request took 21.597846712s, request: GET:https://10.98.0.1:443/apis/utils.devops.gov.bc.ca/v1?timeout=32s
      2022-05-26T12:00:01.798488449Z I0526 12:00:01.798404 1 request.go:621] Throttling request took 1.198039866s, request: GET:https://10.98.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
      2022-05-26T12:00:11.799177479Z I0526 12:00:11.799070 1 request.go:621] Throttling request took 11.198579542s, request: GET:https://10.98.0.1:443/apis/integreatly.org/v1alpha1?timeout=32s
      2022-05-26T12:00:21.997400272Z I0526 12:00:21.997333 1 request.go:621] Throttling request took 21.396652936s, request: GET:https://10.98.0.1:443/apis/imageregistry.operator.openshift.io/v1?timeout=32s

      Right now VPA seems to use the default 5/10 for qps/burst the could be limiting factor in larger clusters.

      Version-Release number of selected component (if applicable):
      4.8/4.9

      How reproducible:
      Pretty reliable in larger clusters

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:
      VPA admission is quite slow and could cause pod startup to fail before 4.9 timeouts

      Expected results:
      Admission should be much faster

      Additional info:

              joelsmith.redhat Joel Smith
              rhn-support-mrobson Matt Robson
              None
              None
              Sunil Choudhary Sunil Choudhary
              None
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: