Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: CNV v4.14.0
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+

Activity Type:
Quality / Stability / Reliability
Story Points:
8
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2210070
Bugzilla Bug:
RHBZ: 2210070
Intelligence Requested:
Market:

Sprint:
CNV Virtualization Sprint 238, CNV Virtualization Sprint 239, CNV Virtualization Sprint 240
Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:
By default we allow only 2 migrations in parallel for one node. If I have many non-migratable VMs on the cluster - they may block good VMs from being migrated during the drainig node.

We have a fixed bug 2124528 about similar scenario - we added backoff to failed migrations and it helps with migrations which run manually (VMIM in Pending state does not affect on manually started migration). But the fix does not work well for automatically running migrations (i.e. kubevirt-evacuation and kubevirt-workload-update).
For example, I have multiple VMs running on one node:

> $ oc get vmi
> NAME AGE PHASE IP NODENAME READY
> vm-a 13h Running 10.129.2.144 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-b 13h Running 10.129.2.127 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-1 13h Running 10.129.2.115 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-2 13h Running 10.129.2.114 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-3 13h Running 10.129.2.116 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-4 13h Running 10.129.2.117 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-5 13h Running 10.129.2.119 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-fedora-node1-6 13h Running 10.129.2.118 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-y 13h Running 10.129.2.142 virt-den-413-zj5v7-worker-0-mr7cc True
> vm-z 13h Running 10.129.2.143 virt-den-413-zj5v7-worker-0-mr7cc True

vm-a(b,y,z) - migratable
vm-fedora-node1-1(2,3...) - non-migratable because of node-selector

After draining the node if both VMIs selected for migration have some migration backoff - their VMIM stuck in Pending state and does not allow other VMIs to create migration:

> $ oc get vmim | egrep -v "Failed|Succeed"
> NAME PHASE VMI
> kubevirt-evacuation-4s6td Pending vm-fedora-node1-3
> kubevirt-evacuation-77fr7 Pending vm-fedora-node1-1

^^ VMIM in Pending state because of backoff, they doing nothing - just waiting when backoff timeout will expire. No any other VMIs are trying to migrate.

But if I run migration manually - it succesfully created
> $ oc get vmim | egrep -v "Failed|Succeed"
> NAME PHASE VMI
> kubevirt-evacuation-4s6td Pending vm-fedora-node1-3
> kubevirt-evacuation-77fr7 Pending vm-fedora-node1-1
> kubevirt-evacuation-ftd25 Running vm-z

Version-Release number of selected component (if applicable):
4.13

How reproducible:
most of the time `good` VMs migrated in appropriate amount of time, but in rare cases I've observed upto 1hr for VM to be migrated.

Steps to Reproduce:
1. run multiple VMs (migratable and non-migratable) on one node
2. drain the node
3.

Actual results:
Migrations in Pending state (waiting when backoff timeout expored) does not allow other VMs to be migrated because of parallel migrations limits

Expected results:
May be we should consider Pending migrations as non-active and they should not affect on the total number of parallel migratons during the node drain?

Additional info:

external trackers

CEE GitLab cpaas-midstream/openshift-virtualization/kubevirt/merge_requests/2405

Github kubevirt/kubevirt/pull/10349

PnT-DevOps Jira CNV-28693

Red Hat Errata Tool 113931

Red Hat Product Errata RHSA-2023:6817