Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Fix Version/s: CNV vfuture
Affects Version/s: None
Component/s: CNV Infrastructure
Labels:

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
NEW
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2118147
Bugzilla Bug:
RHBZ: 2118147

Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

I'm running a scale regression setup on :
=========================================
OCP 4.10.20
OpenShift Virtualization 4.10.3

this is a large-scale setup with 117 nodes and 5000 persistent VMs - before I start my migration testing I make sure the same amount of VMs are going through migration I do that by dividing the nodes into zones:
====================================================
zone-0 rendered-zone-0-8bb922c85674a178fad84d87b091f66f True False False 38 38 38 0 4d
zone-1 rendered-zone-1-0647fa88f407dd22b944c5e27243e2c1 True False False 38 38 38 0 4d
zone-2 rendered-zone-2-0647fa88f407dd22b944c5e27243e2c1 False True False 39 39 39 0 4d
====================================================

And cordoning/un-cordoning according to the current scheduling map,meaning if I have more VMs that I require on the zone I use then
I will cordon all the nodes on that specific zone, and migrate the excess VMs using "virtctl migrate" so that they will get migrated to one of the other 2 zones.

in this specific scenario, I needed a homogeneous zone with only RHEL VMs, so I tried to migrate all the windows VMs to the other 2 zones, in this specific case I had 279 VMs scheduled to zone-3 worker076-worker114.

I started the migration on Sun Aug 14 11:31:13 UTC 2022 , and gave it some time to complete however 38/279 VMs, did not migrate, the reason being that they were attempting to migrate within the zone itself to cordon nodes.
I ran this scale before on 4.9.2, at OCP 4.9.15, and I never had this issue before, so I believe this is a new issue.
note those migrations were not triggered in parallel but serially, since it was not the actual test but just the preparation for it.

the following is a list of the VMs that got rescheduled to the cordoned nodes:

win10-vm0120
win10-vm0122
win10-vm0123
win10-vm0128
win10-vm0131
win10-vm0144
win10-vm0150
win10-vm0154
win10-vm0161
win10-vm0162
win10-vm0167
win10-vm0184
win10-vm0205
win10-vm0208
win10-vm0215
win10-vm0216
win10-vm0220
win10-vm0221
win10-vm0226
win10-vm0230
win10-vm0232
win10-vm0237
win10-vm0243
win10-vm0253
win10-vm0273
win10-vm0277
win10-vm0284
win10-vm0292
win10-vm0314
win10-vm0326
win10-vm0358
win10-vm0420
win10-vm0490
win10-vm0693
win10-vm0708
win10-vm0768
win10-vm0773
win10-vm0924

must-gather:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-migrating-to-cordon-node.tar.gz

VMs list:
=========
rhel82-vm0001 - rhel82-vm4000
win10-vm0001 - win10-vm1000

Assignee:: Dominik Holler

Reporter:: Boaz Ben Shabat

QA Contact:: Guy Chen

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2022/08/14 2:53 PM

Updated:: 2026/01/13 4:44 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates