-
Bug
-
Resolution: Unresolved
-
Critical
-
CNV v4.17.0
-
None
-
Incidents & Support
-
1
-
False
-
-
False
-
None
-
-
CNV Virt-Cluster Sprint 278
-
Important
-
Customer Reported
-
None
Description of problem:
During the drain, kubectl concurrently tries to evict each pod running in the node. For VMs, it will be blocked because of PDB and will be re-attempted again in 5 seconds. So there will be a lot of eviction calls being generated concurrently during the node drain, since only 2 outgoing migrations are allowed by default. Each of these calls is intercepted by the webhook “virt-launcher-eviction-interceptor.kubevirt.io” which calls virt-api.openshift-cnv.svc:443/launcher-eviction-validate. For each webhook call, virt-api calls Kube API twice to get the VMI and pod spec. When there are many VMs running on the node, this can easily hit the client rating limit of virt-api which by default is QPS of 5 and burst of 10. So the calls have to wait for a long time to execute.
I0121 20:17:22.042821 1 request.go:697] Waited for 3m33.071923903s due to client-side throttling, not priority and fairness, request: GET:https://172.31.0.1:443/api/v1/namespaces/default/pods/virt-launcher-testvm-146-jbw27
Since the drain is going on, it keeps piling up the requests. For the customer where we observed the problem, it even reached to 2+ hour:
2025-01-13T23:59:50.899269452Z I0113 23:59:50.899220 1 request.go:697] Waited for 2h21m2.381288025s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/ns/pods/virt-launcher-test-vm-52-lbbkb
Since the calls are waiting for a longer time in the virt-api client queue, the webhook calls from the kube-apiserver will fail with timeout.
W0121 20:13:03.890413 1 dispatcher.go:205] Failed calling webhook, failing open virt-launcher-eviction-interceptor.kubevirt.io: failed calling webhook "virt-launcher-eviction-interceptor.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/launcher-eviction-validate?timeout=10s": context deadline exceeded
Also VMIM creation is intercepted by the "migration-create-validator.kubevirt.io” which is also provided by the virt-api service. This webhook also contacts kube-apiserver to get the VMI spec who also has to wait in the client queue.
W0121 20:22:57.652968 1 dispatcher.go:217] Failed calling webhook, failing closed migration-create-validator.kubevirt.io: failed calling webhook "migration-create-validator.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s": context deadline exceeded
So the evacuation will not progress because the VMIM is not getting created . So the drain will be stuck forever.
Version-Release number of selected component (if applicable):
OpenShift Virtualization 4.17
How reproducible:
100%
Steps to reproducible:
1. Spawn around 50 VMs in a node. 2. Start the drain. 3. Looks at virt-api logs. It will start showing error "Waited for s due to client-side throttling, not priority and fairness" and the waiting time will keep on increasing. 4. Look the kube-apiserver logs and we will observe webhook timeout errors.
Actual results:
Node drain with around 50 VMs is blocked forever because of virt-api client limits
Expected results:
I think the virt-api client defaults are less and cannot handle a drain of around many VMs. Also, repeated calls of webhook for eviction requests that occur every 5 seconds looks inefficient.
Additional info:
- clones
-
CNV-69853 [v4.20.z] Node drain with around 50 VMs is blocked forever because of virt-api client limits
-
- POST
-
- is cloned by
-
CNV-69855 [v4.18.z] Node drain with around 50 VMs is blocked forever because of virt-api client limits
-
- POST
-
- split from
-
CNV-69499 Increase default client rate limits for all virt components
-
- New
-