Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-30386

[2218435] queueing multiple VMs migration causes virt-controller to hit a deadlock.

XMLWordPrintable

    • CNV Virtualization Sprint 239, CNV Virtualization Sprint 240, CNV Virtualization Sprint 241, CNV Virtualization Sprint 243, CNV Virtualization Sprint 244, CNV Virtualization Sprint 245, CNV Virtualization Sprint 246, CNV Virtualization Sprint 247, CNV Virtualization Sprint 248, CNV Virtualization Sprint 252
    • High
    • No

      I'm running a scale regression setup on :
      =========================================
      OpenShift 4.13.2
      OpenShift Virtualization 4.13.1
      OpenShift Container Storage - 4.12.4-rhodf
      this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.

      while I was testing idle VMs migration in bulks - meaning I schedule 100 VMs migrations, wait for completion, and then schedule another 100, I noticed that
      the migration completion rate was slowly degrading with every bulk, starting at 20 seconds per VM and reaching up to 1570 seconds per VM in the last bulks,
      in order to debug this issue I schedule 800 VMs migration so it will be easier to notice the root cause.
      ideally, the expected result is that we will queue all those migration jobs and then execute them at a rate of parallelMigrationsPerCluster,
      however, what actually happened is that all those queues got stuck in the virt-controller migration queue.
      they remained there indefinitely while consuming MEM & CPU, even after the vmim's already failed, the queue remained unphased, in fact, the only thing that caused a few of those queues to be eliminated is when nonvoluntary_ctxt_switches were triggered I eventually killed active virt-controller after 4.5 hours - see attached image virt-controller-queue.

      the way I found to avoid triggering this issue is by making sure through automation that the number of scheduled migrating VMs queue will always be <= parallelMigrationsPerCluster
      by doing that I was able to complete 1200 VMs migration in just above 12 minutes.

      it's important to note that this issue is exclusive to the migration flow, for example when I mass-scheduled 6000 VMs for starting I didn't experience any issues.

      note that I was using the following debug, but the rate at which those logs were generating and getting overwritten made them useless at this scale.
      ============================================================================================================================================

      Spec:
      logVerbosityConfig:
      kubevirt:
      virtController: 9
      virtHandler: 9
      virtLauncher: 9
      virtAPI: 9
      ============================================================================================================================================

      steps to reproduce:
      this issue is 100% reproducible
      1. create a cluster with 800 VMs
      2. initiate a large number of migrations (as easy as running a bunch of "virtctl migrate")

              jelejosne Jed Lejosne
              bbenshab Boaz Ben Shabat
              Guy Chen Guy Chen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: