Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-49427

VirtualMachineExport may trigger alert VirtControllerRESTErrorsHigh

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • CNV v4.18.0
    • CNV v4.16.3
    • Storage Platform
    • None
    • Incidents & Support
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • CNV v4.99.0.rhel9-1527
    • Storage Core Sprint 261, Storage Core Sprint 262
    • Important
    • None

      Description of problem:

      If the VirtualMachineExport is in "skipped" phase (source VM don't exist), it continuously tries to create the secret which will fail with "409 Already Exists". This will increase the number of API failures for the virt-controller and triggers the alert VirtControllerRESTErrorsHigh. Also the status of the OpenShift Virtualization operator will be degraded.

      VirtualMachineExport in "skipped" state:

      # oc get VirtualMachineExport
      NAME                           SOURCEKIND       SOURCENAME   PHASE
      rhel9-pink-raccoon-24-export   VirtualMachine   test-vm      Skipped

       

      The virt-controller is continuously trying to update the status of VirtualMachineExport resource, almost every 2-3 seconds:

      {"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:24:54.632078Z"}
      
      {"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:24:57.665241Z"}
      
      {"component":"virt-controller","level":"info","msg":"Updating VirtualMachineExport openshift-cnv/rhel9-pink-raccoon-24-export","pos":"export.go:492","timestamp":"2024-10-03T17:25:00.685047Z"}

      Each update is triggering a creation of "secret", in around 20 minutes, I have around 700 requests:

      grep "export-token-rhel9-pink-raccoon-24-export" audit.log |grep "Failure"| wc -l
      698

      Also, please refer screenshot of metric.

      This would easily trigger VirtControllerRESTErrorsHigh alert.

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization   4.16.3

      How reproducible:

      100%

      Steps to Reproduce:

      Create a VirtualMachineExport with nonexistent VM so that phase will move to "skipped"

      apiVersion: export.kubevirt.io/v1alpha1
      kind: VirtualMachineExport
      metadata:
        name: rhel9-pink-raccoon-24-export
      spec:
        source:
          apiGroup: "kubevirt.io"
          kind: VirtualMachine
          name: rhel9-pink-raccoon-24-invalid

      After around 20 minutes, look at observe => metrics and query the following:

      rest_client_requests_total{namespace="openshift-cnv",pod=~"virt-controller-.*",code=~"(4|5)[0-9][0-9]"}

      There will be huge number of 409s for secrets resource. Soon the VirtControllerRESTErrorsHigh will be also fired.

      Actual results:

      VirtualMachineExport with phase "skipped" trigger alert VirtControllerRESTErrorsHigh

      Expected results:

       

      Additional info:

       

              rhn-support-awels Alexander Wels
              rhn-support-nashok Nijin Ashok
              Yan Du Yan Du
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: