-
Bug
-
Resolution: Done
-
Normal
-
CNV v4.16.3
-
None
-
Incidents & Support
-
3
-
False
-
-
False
-
CNV v4.99.0.rhel9-1527
-
-
Storage Core Sprint 261, Storage Core Sprint 262
-
Important
-
None
Description of problem:
If the VirtualMachineExport is in "skipped" phase (source VM don't exist), it continuously tries to create the secret which will fail with "409 Already Exists". This will increase the number of API failures for the virt-controller and triggers the alert VirtControllerRESTErrorsHigh. Also the status of the OpenShift Virtualization operator will be degraded.
VirtualMachineExport in "skipped" state:
# oc get VirtualMachineExport NAME SOURCEKIND SOURCENAME PHASE rhel9-pink-raccoon-24-export VirtualMachine test-vm Skipped
The virt-controller is continuously trying to update the status of VirtualMachineExport resource, almost every 2-3 seconds:
{"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:24:54.632078Z"} {"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:24:57.665241Z"} {"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:25:00.685047Z"}
Each update is triggering a creation of "secret", in around 20 minutes, I have around 700 requests:
grep "export-token-rhel9-pink-raccoon-24-export" audit.log |grep "Failure"| wc -l 698
Also, please refer screenshot of metric.
This would easily trigger VirtControllerRESTErrorsHigh alert.
Version-Release number of selected component (if applicable):
OpenShift Virtualization 4.16.3
How reproducible:
100%
Steps to Reproduce:
Create a VirtualMachineExport with nonexistent VM so that phase will move to "skipped"
apiVersion: export.kubevirt.io/v1alpha1
kind: VirtualMachineExport
metadata:
name: rhel9-pink-raccoon-24-export
spec:
source:
apiGroup: "kubevirt.io"
kind: VirtualMachine
name: rhel9-pink-raccoon-24-invalid
After around 20 minutes, look at observe => metrics and query the following:
rest_client_requests_total{namespace="openshift-cnv",pod=~"virt-controller-.*",code=~"(4|5)[0-9][0-9]"}
There will be huge number of 409s for secrets resource. Soon the VirtControllerRESTErrorsHigh will be also fired.
Actual results:
VirtualMachineExport with phase "skipped" trigger alert VirtControllerRESTErrorsHigh
Expected results:
Additional info:
- links to
-
RHEA-2024:139653 OpenShift Virtualization 4.18.0 Images
- mentioned on