Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: CNV v4.19.z
Affects Version/s: CNV v4.17.0
Component/s: CNV Virt-Cluster
Labels:
None

Activity Type:
Incidents & Support
Story Points:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
Git Pull Request:
https://github.com/kubevirt/kubevirt/pull/15752
Market:

Sprint:
CNV Virt-Cluster Sprint 278, CNV Virt-Cluster Sprint 279, CNV Virt-Cluster Sprint 280
Severity:
Important
Customer Impact:

Customer Reported

Target Version:

cnv v4.19.3

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

During the drain, kubectl concurrently tries to evict each pod running in the node. For VMs, it will be blocked because of PDB and will be re-attempted again in 5 seconds. So there will be a lot of eviction calls being generated concurrently during the node drain, since only 2 outgoing migrations are allowed by default. Each of these calls is intercepted by the webhook “virt-launcher-eviction-interceptor.kubevirt.io” which calls virt-api.openshift-cnv.svc:443/launcher-eviction-validate. For each webhook call, virt-api calls Kube API twice to get the VMI and pod spec. When there are many VMs running on the node, this can easily hit the client rating limit of virt-api which by default is QPS of 5 and burst of 10. So the calls have to wait for a long time to execute.

I0121 20:17:22.042821       1 request.go:697] Waited for 3m33.071923903s due to client-side throttling, not priority and fairness, request: GET:https://172.31.0.1:443/api/v1/namespaces/default/pods/virt-launcher-testvm-146-jbw27

Since the drain is going on, it keeps piling up the requests. For the customer where we observed the problem, it even reached to 2+ hour:

2025-01-13T23:59:50.899269452Z I0113 23:59:50.899220 1 request.go:697] Waited for 2h21m2.381288025s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/ns/pods/virt-launcher-test-vm-52-lbbkb

Since the calls are waiting for a longer time in the virt-api client queue, the webhook calls from the kube-apiserver will fail with timeout.

W0121 20:13:03.890413       1 dispatcher.go:205] Failed calling webhook, failing open virt-launcher-eviction-interceptor.kubevirt.io: failed calling webhook "virt-launcher-eviction-interceptor.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/launcher-eviction-validate?timeout=10s": context deadline exceeded

Also VMIM creation is intercepted by the "migration-create-validator.kubevirt.io” which is also provided by the virt-api service. This webhook also contacts kube-apiserver to get the VMI spec who also has to wait in the client queue.

W0121 20:22:57.652968       1 dispatcher.go:217] Failed calling webhook, failing closed migration-create-validator.kubevirt.io: failed calling webhook "migration-create-validator.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s": context deadline exceeded

So the evacuation will not progress because the VMIM is not getting created . So the drain will be stuck forever.

Version-Release number of selected component (if applicable):

OpenShift Virtualization 4.17

How reproducible:

100%

Steps to reproducible:

1. Spawn around 50 VMs in a node.
2. Start the drain.
3. Looks at virt-api logs. It will start showing error "Waited for s due to client-side throttling, not priority and fairness" and the waiting time will keep on increasing.
4. Look the kube-apiserver logs and we will observe webhook timeout errors.

Actual results:

Node drain with around 50 VMs is blocked forever because of virt-api client limits

Expected results:

I think the virt-api client  defaults are less and cannot handle a drain of around many VMs. Also, repeated calls of webhook for eviction requests that occur every 5 seconds looks inefficient.

Additional info:

clones

CNV-69853 [v4.20.z] Node drain with around 50 VMs is blocked forever because of virt-api client limits

POST

is cloned by

CNV-69855 [v4.18.z] Node drain with around 50 VMs is blocked forever because of virt-api client limits

POST

split from

CNV-69499 Increase default client rate limits for all virt components

Assignee:: Lee Yarwood

Reporter:: Nijin Ashok

QA Contact:: Kedar Bidarkar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/09/24 1:45 PM

Updated:: 2025/11/17 10:52 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates