-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
False
-
-
False
-
CLOSED
-
---
-
---
-
CNV Virtualization Sprint 224, CNV Virtualization Sprint 225, CNV Virtualization Sprint 226, CNV Virtualization Sprint 227, CNV Virtualization Sprint 228, CNV Virtualization Sprint 229
-
High
-
None
Description of problem:
When upgrading CNV, all of the VirtualMachines in the cluster are being live-migrated in order to update their virt-launchers.
If there is an issue in a node that hosting VMs that prevents migration from it to another node, due to the fact that migration-proxy can't establish a connection between the target and source node, the target virt-launcher pod is exited in an Error state.
In that case, virt-handler trying to migrate it again, failing to do so for the same reason.
The default value for "parallelOutboundMigrationsPerNode" is 5, meaning that the failed virt-launcher pods are accumulating on the cluster in a rate of 5 per every few minutes.
If the root cause is not resolved, the number of pods in Error state can reach few thousands in several hours, which might bring the cluster down due to enormous number of etcd objects.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Have a running VMI on a node with networking issues.
2. Complete an upgrade of CNV
3.
Actual results:
Note the high amount of Errored virt-launcher pods that are keeping to accumulate endlessly.
Expected results:
Kubevirt should monitor that issue, stop the migrations from the node in question and raise a proper high-severity alert.
Additional info:
virt-handler pod logs from the node in question when the issue is occurring is attached.
- blocks
-
CNV-23025 [2149631] On upgrade, when live-migration is failed due to an infra issue, virt-handler continuously and endlessly tries to migrate it
- Closed
- external trackers