Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-69854

[v4.19.z] Node drain with around 50 VMs is blocked forever because of virt-api client limits

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • CNV v4.19.z
    • CNV v4.17.0
    • CNV Virt-Cluster
    • None
    • CNV Virt-Cluster Sprint 278
    • Important
    • Customer Reported
    • None

      Description of problem:

      During the drain, kubectl concurrently tries to evict each pod running in the node. For VMs, it will be blocked  because of PDB and will be re-attempted again in 5 seconds. So there will be a lot of eviction calls being generated concurrently during the node drain, since only 2 outgoing migrations are allowed by default. Each of these calls is intercepted by the webhook “virt-launcher-eviction-interceptor.kubevirt.io” which calls virt-api.openshift-cnv.svc:443/launcher-eviction-validate. For each webhook call,  virt-api calls Kube API twice to get the VMI and pod spec. When there are many VMs running on the node, this can easily hit the client rating limit of virt-api which by default is QPS of 5 and burst of 10. So the calls have to wait for a long time to execute.

      I0121 20:17:22.042821       1 request.go:697] Waited for 3m33.071923903s due to client-side throttling, not priority and fairness, request: GET:https://172.31.0.1:443/api/v1/namespaces/default/pods/virt-launcher-testvm-146-jbw27

      Since the drain is going on, it keeps piling up the requests. For the customer where we observed the problem, it even reached to 2+ hour:

      2025-01-13T23:59:50.899269452Z I0113 23:59:50.899220 1 request.go:697] Waited for 2h21m2.381288025s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/ns/pods/virt-launcher-test-vm-52-lbbkb

      Since the calls are waiting for a longer time in the virt-api client queue, the webhook calls from the kube-apiserver will fail with timeout.

      W0121 20:13:03.890413       1 dispatcher.go:205] Failed calling webhook, failing open virt-launcher-eviction-interceptor.kubevirt.io: failed calling webhook "virt-launcher-eviction-interceptor.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/launcher-eviction-validate?timeout=10s": context deadline exceeded

      Also VMIM creation is intercepted by the "migration-create-validator.kubevirt.io” which is also provided by the virt-api service. This webhook also contacts kube-apiserver to get the VMI spec who also has to wait in the client queue. 

      W0121 20:22:57.652968       1 dispatcher.go:217] Failed calling webhook, failing closed migration-create-validator.kubevirt.io: failed calling webhook "migration-create-validator.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s": context deadline exceeded

      So the evacuation will not progress because the VMIM is not getting created . So the drain will be stuck forever.   

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization 4.17

      How reproducible:

      100%

      Steps to reproducible:

      1. Spawn around 50 VMs in a node.
      2. Start the drain.
      3. Looks at virt-api logs. It will start showing error "Waited for s due to client-side throttling, not priority and fairness" and the waiting time will keep on increasing.
      4. Look the kube-apiserver logs and we will observe webhook timeout errors.
      

      Actual results:

      Node drain with around 50 VMs is blocked forever because of virt-api client limits

      Expected results:

      I think the virt-api client  defaults are less and cannot handle a drain of around many VMs. Also, repeated calls of webhook for eviction requests that occur every 5 seconds looks inefficient.
      

      Additional info:

       

              rhn-support-lyarwood Lee Yarwood
              rhn-support-nashok Nijin Ashok
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: