Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-29308

[2210070] During draining node non-migratable VM may block other VMs from being migrated for a long time

XMLWordPrintable

    • CNV Virtualization Sprint 238, CNV Virtualization Sprint 239, CNV Virtualization Sprint 240
    • Important
    • No

      Description of problem:
      By default we allow only 2 migrations in parallel for one node. If I have many non-migratable VMs on the cluster - they may block good VMs from being migrated during the drainig node.

      We have a fixed bug 2124528 about similar scenario - we added backoff to failed migrations and it helps with migrations which run manually (VMIM in Pending state does not affect on manually started migration). But the fix does not work well for automatically running migrations (i.e. kubevirt-evacuation and kubevirt-workload-update).
      For example, I have multiple VMs running on one node:

      > $ oc get vmi
      > NAME AGE PHASE IP NODENAME READY
      > vm-a 13h Running 10.129.2.144 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-b 13h Running 10.129.2.127 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-1 13h Running 10.129.2.115 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-2 13h Running 10.129.2.114 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-3 13h Running 10.129.2.116 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-4 13h Running 10.129.2.117 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-5 13h Running 10.129.2.119 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-fedora-node1-6 13h Running 10.129.2.118 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-y 13h Running 10.129.2.142 virt-den-413-zj5v7-worker-0-mr7cc True
      > vm-z 13h Running 10.129.2.143 virt-den-413-zj5v7-worker-0-mr7cc True

      vm-a(b,y,z) - migratable
      vm-fedora-node1-1(2,3...) - non-migratable because of node-selector

      After draining the node if both VMIs selected for migration have some migration backoff - their VMIM stuck in Pending state and does not allow other VMIs to create migration:

      > $ oc get vmim | egrep -v "Failed|Succeed"
      > NAME PHASE VMI
      > kubevirt-evacuation-4s6td Pending vm-fedora-node1-3
      > kubevirt-evacuation-77fr7 Pending vm-fedora-node1-1

      ^^ VMIM in Pending state because of backoff, they doing nothing - just waiting when backoff timeout will expire. No any other VMIs are trying to migrate.

      But if I run migration manually - it succesfully created
      > $ oc get vmim | egrep -v "Failed|Succeed"
      > NAME PHASE VMI
      > kubevirt-evacuation-4s6td Pending vm-fedora-node1-3
      > kubevirt-evacuation-77fr7 Pending vm-fedora-node1-1
      > kubevirt-evacuation-ftd25 Running vm-z

      Version-Release number of selected component (if applicable):
      4.13

      How reproducible:
      most of the time `good` VMs migrated in appropriate amount of time, but in rare cases I've observed upto 1hr for VM to be migrated.

      Steps to Reproduce:
      1. run multiple VMs (migratable and non-migratable) on one node
      2. drain the node
      3.

      Actual results:
      Migrations in Pending state (waiting when backoff timeout expored) does not allow other VMs to be migrated because of parallel migrations limits

      Expected results:
      May be we should consider Pending migrations as non-active and they should not affect on the total number of parallel migratons during the node drain?

      Additional info:

              ibezukh Igor Bezukh
              dshchedr@redhat.com Denys Shchedrivyi
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: