Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-20450

[2118147] migrating VMs get scheduled to cordoned nodes and and fail the migration testing.

XMLWordPrintable

    • Important
    • No

      I'm running a scale regression setup on :
      =========================================
      OCP 4.10.20
      OpenShift Virtualization 4.10.3

      this is a large-scale setup with 117 nodes and 5000 persistent VMs - before I start my migration testing I make sure the same amount of VMs are going through migration I do that by dividing the nodes into zones:
      ====================================================
      zone-0 rendered-zone-0-8bb922c85674a178fad84d87b091f66f True False False 38 38 38 0 4d
      zone-1 rendered-zone-1-0647fa88f407dd22b944c5e27243e2c1 True False False 38 38 38 0 4d
      zone-2 rendered-zone-2-0647fa88f407dd22b944c5e27243e2c1 False True False 39 39 39 0 4d
      ====================================================

      And cordoning/un-cordoning according to the current scheduling map,meaning if I have more VMs that I require on the zone I use then
      I will cordon all the nodes on that specific zone, and migrate the excess VMs using "virtctl migrate" so that they will get migrated to one of the other 2 zones.

      in this specific scenario, I needed a homogeneous zone with only RHEL VMs, so I tried to migrate all the windows VMs to the other 2 zones, in this specific case I had 279 VMs scheduled to zone-3 worker076-worker114.

      I started the migration on Sun Aug 14 11:31:13 UTC 2022 , and gave it some time to complete however 38/279 VMs, did not migrate, the reason being that they were attempting to migrate within the zone itself to cordon nodes.
      I ran this scale before on 4.9.2, at OCP 4.9.15, and I never had this issue before, so I believe this is a new issue.
      note those migrations were not triggered in parallel but serially, since it was not the actual test but just the preparation for it.

      the following is a list of the VMs that got rescheduled to the cordoned nodes:

      win10-vm0120
      win10-vm0122
      win10-vm0123
      win10-vm0128
      win10-vm0131
      win10-vm0144
      win10-vm0150
      win10-vm0154
      win10-vm0161
      win10-vm0162
      win10-vm0167
      win10-vm0184
      win10-vm0205
      win10-vm0208
      win10-vm0215
      win10-vm0216
      win10-vm0220
      win10-vm0221
      win10-vm0226
      win10-vm0230
      win10-vm0232
      win10-vm0237
      win10-vm0243
      win10-vm0253
      win10-vm0273
      win10-vm0277
      win10-vm0284
      win10-vm0292
      win10-vm0314
      win10-vm0326
      win10-vm0358
      win10-vm0420
      win10-vm0490
      win10-vm0693
      win10-vm0708
      win10-vm0768
      win10-vm0773
      win10-vm0924

      must-gather:
      http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-migrating-to-cordon-node.tar.gz

      VMs list:
      =========
      rhel82-vm0001 - rhel82-vm4000
      win10-vm0001 - win10-vm1000

              dholler@redhat.com Dominik Holler
              bbenshab@redhat.com Boaz Ben Shabat
              Guy Chen Guy Chen
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: