Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-40786

webhook failure during live migration

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • CNV v4.17.0
    • CNV v4.14.4
    • CNV Virtualization
    • None
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • No
    • ---
    • ---
    • Medium

      Description of problem:

      When live migrating 100 VMs per batch in sequences like 1-100, 101-200, up to 5901-6000, starting each new batch only after the previous one completes, approximately 16% of the VirtualMachineInstanceMigration (VMIM) objects failed to be created. See the attached text document and screenshots for a list of missing VMIMs and the CPU/memory usage of all CNV components during the migration period.  

      Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "migration-create-validator.kubevirt.io": failed to call webhook: Post "https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s": context deadline exceeded

      Version-Release number of selected component (if applicable):

      NAMESPACE                              NAME                                          DISPLAY                       VERSION               REPLACES                                      PHASE
      openshift-cnv                          kubevirt-hyperconverged-operator.v4.14.4      OpenShift Virtualization      4.14.4                kubevirt-hyperconverged-operator.v4.14.3      Succeeded
      oc version
      Client Version: 4.14.13
      Kustomize Version: v5.0.1
      Server Version: 4.14.13
      Kubernetes Version: v1.27.10+28ed2d7

      How reproducible:

      100% 

      Steps to Reproduce:

      1. Create 4000 to 6000 VMs
      2. Live migrate them 100 in a batch
      3. Usually starting to hit the issue after migrating around 3000 VMIs

      Actual results:

      16% of the VMIM objects did not get created due to validation timeout.

      Expected results:

      VMIM objects should get created and VMIs should be migrated

      Additional info:
      Virt-API QPS/burst

          handlerConfiguration:
            restClient:
              rateLimiter:
                tokenBucketRateLimiter:
                  burst: 400
                  qps: 200

      Migration config 

      migrations:
            allowAutoConverge: false
            allowPostCopy: false
            completionTimeoutPerGiB: 800
            parallelMigrationsPerCluster: 25
            parallelOutboundMigrationsPerNode: 5
            progressTimeout: 150

      Batch migration script:

      batch_migrate_vmi() {
         local vm_num="$1"
         local i
         for ((i="$vm_num"; i<vm_num+migration_batch_num; i++)); do
            sed "s/placeholder/$i/g" "$vmi_migration_template" | oc create -f - &
         done
         wait
      }


      Some TLS handshake errors and client side throttling were observed from virt-api logs:

      {"component":"virt-api","level":"info","msg":"http: TLS handshake error from 10.128.0.2:41298: EOF\n","pos":"server.go:3217","timestamp":"2024-04-12T07:58:54.358794Z"}
      {"component":"virt-api","level":"info","msg":"http: TLS handshake error from 10.128.0.2:41308: read tcp 10.131.55.14:8443->10.128.0.2:41308: read: connection reset by peer\n","pos":"server.go:3217","timestamp":"2024-04-12T07:58:54.359138Z"}
      

      Mustgather log (VPN required) 

      http://storage.scalelab.redhat.com/lguoqing/mustgather/live-migration-webhook/

            sgott@redhat.com Stuart Gott
            rh-ee-lguoqing Li Guoqing
            Kedar Bidarkar Kedar Bidarkar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: