Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: CNV v4.18.12
Affects Version/s: CNV v4.18.0
Component/s: CNV Virt-Node
Labels:
- Scale&Perf-QE

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
CNV v4.18.4.rhel9-5
Documentation Type:

Release Notes
Release Note Text:

Hide
VirtualMachineInstanceMigrations could fail when the combined pending and undefined migrations approaches 400. This issue could manifest either during workload upgrade or if enough migrations are triggered. A VM count < 400 may not encounter these virt-launcher upgrade issues, while migrations greater than 400 would have increased likelihood of occurring. See bash script for workaround.

Show
VirtualMachineInstanceMigrations could fail when the combined pending and undefined migrations approaches 400. This issue could manifest either during workload upgrade or if enough migrations are triggered. A VM count < 400 may not encounter these virt-launcher upgrade issues, while migrations greater than 400 would have increased likelihood of occurring. See bash script for workaround.
Release Note Type:
Known Issue
Release Note Status:
Done
Git Pull Request:
https://github.com/kubevirt/kubevirt/pull/14552
Market:

Severity:
Critical

Release Blocker:
Proposed
Target Version:

CNV v4.19.0

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

VirtualMachineInstanceMigrations failing during upgrade under load

Version-Release number of selected component (if applicable):

4.18.0.rhel9-813

How reproducible:

Reproduced multiple times during the same workload update cycle.

Steps to Reproduce:

1. Deploy OCP/LSO/ODF/CNV 4.16
2. Perform EUS upgrade with internal VM load
3. Upgrade OCP/LSO/ODF/CNV to 4.18
4. Enable LiveMigrate prior to unpausing worker MCP

Actual results:

After 24h, only 436 VM have migrated successfully. The other 1564 VM still have the old virt-launcher.

After scaling back CNV and resetting the migrations, as the combined number of Pending and Undefined(Blank) Migrations approaches 500, the migrations begin to fail again.


Errors seen include:

virt-controller:
...
2025-02-15T14:31:51.482528488Z W0215 14:31:51.482514       1 shared_informer.go:597] resyncPeriod 5m0s is smaller than resyncCheckPeriod 13h27m2.088196035s and the informer has already started. Changing it to 13h27m2.088196035s
...
2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"info","msg":"Looking for DataVolume Ref","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2283","timestamp":"2025-02-15T14:32:25.439454Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"error","msg":"Cant find the matching VM for DataVolume: vm-instancetype-cirros-test-0000-volume","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2294","timestamp":"2025-02-15T14:32:25.439525Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
...
"Cant find the matching VM for DataVolume..." appears to be repeated for all VMs
 
The rest of the log is filled with this error along with the occasional pdb size increase/decrease
"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request"
2025-02-15T16:15:29.970377919Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance default/vm-instancetype-cirros-test-1522","pos":"vmi.go:253","reason":"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request","timestamp":"2025-02-15T16:15:29.970260Z"} 
 
virt-handler errors seen:
2025-02-15T17:52:18.234412246Z {"component":"virt-handler","level":"error","msg":"failed to scrape metrics from //pods/77ef52ad-a310-4b0a-89dc-ad28ce92da55/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","pos":"scrapper.go:48","reason":"failed to connect to cmd client socket: context deadline exceeded","timestamp":"2025-02-15T17:52:18.225320Z"}
...
2025-02-15T17:53:30.083204851Z {"component":"virt-handler","kind":"","level":"error","msg":"target migration listener is not up for this vmi","name":"vm-instancetype-cirros-test-1986","namespace":"default","pos":"vm.go:819","timestamp":"2025-02-15T17:53:30.083162Z","uid":"2946fe73-5b4f-40ac-8f74-02a74ae78152"}
...
2025-02-15T16:31:29.977835795Z {"component":"virt-handler","level":"error","msg":"failed to scrape domain stats for VMI default/vm-instancetype-cirros-test-0668","pos":"queue.go:104","reason":"expected 1 value from DomainstatsScraper, got 0","timestamp":"2025-02-15T16:31:29.977796Z"}

Expected results:

All migrations successful

Additional info:

Load within VMs was very high, along with memory and disk activity.
Reduced VM CPU load with no change.
Increased OSD cpus from 4 to 8 to 12 due to OSDCPULoadHigh, had no effect.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

reset-migrations.sh
3 kB
2025/02/21 10:39 AM
Screenshot From 2025-02-17 17-52-34.png
105 kB
2025/02/17 10:58 PM
Screenshot From 2025-02-17 18-03-42.png
239 kB
2025/02/17 11:14 PM
Screenshot From 2025-02-18 08-19-24.png
103 kB
2025/02/18 1:19 PM
Screenshot From 2025-02-20 06-18-42.png
104 kB
2025/02/20 11:21 AM
Screenshot From 2025-02-21 07-02-13.png
158 kB
2025/02/21 12:02 PM
Screenshot From 2025-02-21 12-46-04.png
80 kB
2025/02/21 5:46 PM

blocks

CNV-54728 Performance upgrade testing for 4.18.0

Closed

clones

CNV-56658 VirtualMachineInstanceMigrations with undefined/blank phase during upgrade under load

Closed

is cloned by

CNV-59995 [CNV-4.17] Upgrade/Migration fails when combined pending and undefined migrations approaches 400

POST

CNV-56998 RN: Known issue - VirtualMachineInstanceMigrations failing when combined pending and undefined migrations approaches 400

Closed

CNV-59994 [CNV-4.19] Upgrade/Migration fails when combined pending and undefined migrations approaches 400

Closed

CNV-59996 [CNV-4.16] Upgrade/Migration fails when combined pending and undefined migrations approaches 400

Closed

is related to

CNV-30386 [2218435] queueing multiple VMs migration causes virt-controller to hit a deadlock.

POST

CNV-56912 The metric for kubevirt_vmi_migration_failed include VMIMs in 'undefined' status

Closed

is triggered by

CNV-54728 Performance upgrade testing for 4.18.0

Closed

is triggering

CNV-59991 virt-controller excessive info-level logging during migrations

New

links to

KCS Article

RHEA-2025:152505 OpenShift Virtualization 4.18.12 Images

(1 is cloned by, 2 is related to, 1 is triggered by, 1 is triggering, 2 links to)

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates