-
Bug
-
Resolution: Unresolved
-
Normal
-
CNV v4.14.4
-
None
-
0.42
-
False
-
-
False
-
None
-
No
-
---
-
---
-
-
Medium
Description of problem:
When live migrating 100 VMs per batch in sequences like 1-100, 101-200, up to 5901-6000, starting each new batch only after the previous one completes, approximately 16% of the VirtualMachineInstanceMigration (VMIM) objects failed to be created. See the attached text document and screenshots for a list of missing VMIMs and the CPU/memory usage of all CNV components during the migration period.
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "migration-create-validator.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s": context deadline exceeded
Version-Release number of selected component (if applicable):
NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-cnv kubevirt-hyperconverged-operator.v4.14.4 OpenShift Virtualization 4.14.4 kubevirt-hyperconverged-operator.v4.14.3 Succeeded
oc version Client Version: 4.14.13 Kustomize Version: v5.0.1 Server Version: 4.14.13 Kubernetes Version: v1.27.10+28ed2d7
How reproducible:
100%
Steps to Reproduce:
1. Create 4000 to 6000 VMs 2. Live migrate them 100 in a batch 3. Usually starting to hit the issue after migrating around 3000 VMIs
Actual results:
16% of the VMIM objects did not get created due to validation timeout.
Expected results:
VMIM objects should get created and VMIs should be migrated
Additional info:
Virt-API QPS/burst
handlerConfiguration: restClient: rateLimiter: tokenBucketRateLimiter: burst: 400 qps: 200
Migration config
migrations: allowAutoConverge: false allowPostCopy: false completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 25 parallelOutboundMigrationsPerNode: 5 progressTimeout: 150
Batch migration script:
batch_migrate_vmi() { local vm_num="$1" local i for ((i="$vm_num"; i<vm_num+migration_batch_num; i++)); do sed "s/placeholder/$i/g" "$vmi_migration_template" | oc create -f - & done wait }
Some TLS handshake errors and client side throttling were observed from virt-api logs:
{"component":"virt-api","level":"info","msg":"http: TLS handshake error from 10.128.0.2:41298: EOF\n","pos":"server.go:3217","timestamp":"2024-04-12T07:58:54.358794Z"} {"component":"virt-api","level":"info","msg":"http: TLS handshake error from 10.128.0.2:41308: read tcp 10.131.55.14:8443->10.128.0.2:41308: read: connection reset by peer\n","pos":"server.go:3217","timestamp":"2024-04-12T07:58:54.359138Z"}
Mustgather log (VPN required)
http://storage.scalelab.redhat.com/lguoqing/mustgather/live-migration-webhook/