Some background:
-------------------------
I'm running a scale setup with 84 nodes, and I'm using OCS cpeh-rbd as backend storage.
I'm attempting to deploy a large amount of VM's, but I noticed some VM's are missing,
This issue is a real problem for me since it breaks my measurements & prevents VM deployment.
looking at the virt-controller logs we can see the following prints:
{"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine default/master-0-win10-vm0075","pos":"vm.go:175","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:12.657482Z"}
{"component":"virt-controller","kind":"","level":"error","msg":"Updating api version annotations failed","name":"master-0-win10-vm0043","namespace":"default","pos":"vm.go:209","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:17.599926Z","uid":"ace78d0a-482c-43f8-bb87-c2d16450467b"}
{"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine default/master-0-win10-vm0043","pos":"vm.go:175","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:17.599989Z"}
{"component":"virt-controller","kind":"","level":"error","msg":"Updating api version annotations failed","name":"master-0-win10-vm0002","namespace":"default","pos":"vm.go:209","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:18.298490Z","uid":"467ab09a-e3d0-4a09-8814-52d0ea91dd13"}
{"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine default/master-0-win10-vm0002","pos":"vm.go:175","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:18.298552Z"}
{"component":"virt-controller","kind":"","level":"error","msg":"Updating api version annotations failed","name":"master-0-win10-vm0007","namespace":"default","pos":"vm.go:209","reason":"Internal error occurred: failed calling webhook \"virtualmachine-validator.kubevirt.io\": Post \"https://virt-api.openshift-cnv.svc:443/virtualmachines-validate?timeout=10s\": context deadline exceeded","timestamp":"2021-12-29T10:26:22.670068Z","uid":"6640416a-bdad-450e-b11f-38508a3af158"}
[root@e26-h01-000-r640 ~]# oc logs virt-controller-655db5c9cf-rdqfg|grep "Internal error"|wc -l
148
another thing I have to mention is that we never reached the 10s timeout, in most cases we get the "deadline exceeded" almost immediately after submitting the deployment request (via YAML).
Versions of all relevant components:
===================================
CNV 4.9.1
OCS 4.9.0
LSO 4.9.0-202111151318
OCP 4.9.12
must-gather:
============
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/cnv_must_gather_failed_calling_webhook.tar.gz