-
Bug
-
Resolution: Unresolved
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
NEW
-
Important
-
No
I'm running a scale regression setup on :
=========================================
OCP 4.10.20
OpenShift Virtualization 4.10.3
this is a large-scale setup with 117 nodes and 5000 persistent VMs - before I start my migration testing I make sure the same amount of VMs are going through migration I do that by dividing the nodes into zones:
====================================================
zone-0 rendered-zone-0-8bb922c85674a178fad84d87b091f66f True False False 38 38 38 0 4d
zone-1 rendered-zone-1-0647fa88f407dd22b944c5e27243e2c1 True False False 38 38 38 0 4d
zone-2 rendered-zone-2-0647fa88f407dd22b944c5e27243e2c1 False True False 39 39 39 0 4d
====================================================
And cordoning/un-cordoning according to the current scheduling map,meaning if I have more VMs that I require on the zone I use then
I will cordon all the nodes on that specific zone, and migrate the excess VMs using "virtctl migrate" so that they will get migrated to one of the other 2 zones.
in this specific scenario, I needed a homogeneous zone with only RHEL VMs, so I tried to migrate all the windows VMs to the other 2 zones, in this specific case I had 279 VMs scheduled to zone-3 worker076-worker114.
I started the migration on Sun Aug 14 11:31:13 UTC 2022 , and gave it some time to complete however 38/279 VMs, did not migrate, the reason being that they were attempting to migrate within the zone itself to cordon nodes.
I ran this scale before on 4.9.2, at OCP 4.9.15, and I never had this issue before, so I believe this is a new issue.
note those migrations were not triggered in parallel but serially, since it was not the actual test but just the preparation for it.
the following is a list of the VMs that got rescheduled to the cordoned nodes:
win10-vm0120
win10-vm0122
win10-vm0123
win10-vm0128
win10-vm0131
win10-vm0144
win10-vm0150
win10-vm0154
win10-vm0161
win10-vm0162
win10-vm0167
win10-vm0184
win10-vm0205
win10-vm0208
win10-vm0215
win10-vm0216
win10-vm0220
win10-vm0221
win10-vm0226
win10-vm0230
win10-vm0232
win10-vm0237
win10-vm0243
win10-vm0253
win10-vm0273
win10-vm0277
win10-vm0284
win10-vm0292
win10-vm0314
win10-vm0326
win10-vm0358
win10-vm0420
win10-vm0490
win10-vm0693
win10-vm0708
win10-vm0768
win10-vm0773
win10-vm0924
must-gather:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-migrating-to-cordon-node.tar.gz
VMs list:
=========
rhel82-vm0001 - rhel82-vm4000
win10-vm0001 - win10-vm1000