Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: CNV v4.11.3
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4?
- cnvbugsm
- devel_ack?
- pm_ack?
- qa_ack?

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2149631
Bugzilla Bug:
RHBZ: 2149631
Intelligence Requested:
Market:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

+++ This bug was initially created as a clone of Bug #2124528 +++

Description of problem:
When upgrading CNV, all of the VirtualMachines in the cluster are being live-migrated in order to update their virt-launchers.
If there is an issue in a node that hosting VMs that prevents migration from it to another node, due to the fact that migration-proxy can't establish a connection between the target and source node, the target virt-launcher pod is exited in an Error state.
In that case, virt-handler trying to migrate it again, failing to do so for the same reason.
The default value for "parallelOutboundMigrationsPerNode" is 5, meaning that the failed virt-launcher pods are accumulating on the cluster in a rate of 5 per every few minutes.
If the root cause is not resolved, the number of pods in Error state can reach few thousands in several hours, which might bring the cluster down due to enormous number of etcd objects.

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Have a running VMI on a node with networking issues.
2. Complete an upgrade of CNV
3.

Actual results:
Note the high amount of Errored virt-launcher pods that are keeping to accumulate endlessly.

Expected results:
Kubevirt should monitor that issue, stop the migrations from the node in question and raise a proper high-severity alert.

Additional info:
virt-handler pod logs from the node in question when the issue is occurring is attached.

— Additional comment from on 2022-09-07 12:14:38 UTC —

The looping endlessly is an intentional design pattern of Kubernetes, because it's a declarative system. If we were to halt the migration process, upgrade couldn't finish. That's an even worse outcome.

What's a greater concern to me is that garbage collection does not appear to be occuring. There should not be thousands of defunct pods as a result of this.

— Additional comment from on 2022-09-07 12:20:17 UTC —

Prioritizing this as urgent because it is unpleasant during an upgrade process, and if this situation occurs, there's no immediate/easy way to remediate the issue.

— Additional comment from Antonio Cardace on 2022-09-15 10:09:12 UTC —

@sgott@redhat.com We do garbage collection only on the migration objects, not on the target pods.

We can use the same garbage collection mechanism for pods, even if it can be debatable as those pods might contain useful info that a cluster admin might want to inspect to understand what's wrong with the cluster.

— Additional comment from Antonio Cardace on 2022-11-18 16:11:33 UTC —

Deferring to 4.12.1 because we're past blockers-only and this is not considered a blocker for 4.12.

— Additional comment from Kedar Bidarkar on 2022-11-30 13:00:27 UTC —

Debarati/IUO team suggested this would be a blocker bug as it is causing issues with upgrades from time to time.

As the PR/fix is already in releas-0.58.0 feel we target this bug to 4.12.0 release itself.

— Additional comment from on 2022-11-30 13:04:27 UTC —

Per Comment #5, moving this to 4.12.0

is blocked by

CNV-21049 [2124528] On upgrade, when live-migration is failed due to an infra issue, virt-handler continuously and endlessly tries to migrate it

Closed

external trackers

Red Hat Issue Tracker CNV-23025

Assignee:: Jed Lejosne

Reporter:: Kedar Bidarkar

QA Contact:: Kedar Bidarkar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/11/30 1:08 PM

Updated:: 2025/08/07 8:55 PM

Resolved:: 2022/12/21 1:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates