-
Bug
-
Resolution: Done-Errata
-
Critical
-
CNV v4.18.0
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
CNV v4.18.4.rhel9-5
-
Release Notes
-
-
Known Issue
-
Done
-
-
Critical
-
Proposed
-
None
Description of problem:
VirtualMachineInstanceMigrations failing during upgrade under load
Version-Release number of selected component (if applicable):
4.18.0.rhel9-813
How reproducible:
Reproduced multiple times during the same workload update cycle.
Steps to Reproduce:
1. Deploy OCP/LSO/ODF/CNV 4.16 2. Perform EUS upgrade with internal VM load 3. Upgrade OCP/LSO/ODF/CNV to 4.18 4. Enable LiveMigrate prior to unpausing worker MCP
Actual results:
After 24h, only 436 VM have migrated successfully. The other 1564 VM still have the old virt-launcher.
After scaling back CNV and resetting the migrations, as the combined number of Pending and Undefined(Blank) Migrations approaches 500, the migrations begin to fail again.
Errors seen include:
virt-controller:
...
2025-02-15T14:31:51.482528488Z W0215 14:31:51.482514 Â Â Â 1 shared_informer.go:597] resyncPeriod 5m0s is smaller than resyncCheckPeriod 13h27m2.088196035s and the informer has already started. Changing it to 13h27m2.088196035s
...
2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"info","msg":"Looking for DataVolume Ref","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2283","timestamp":"2025-02-15T14:32:25.439454Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"error","msg":"Cant find the matching VM for DataVolume: vm-instancetype-cirros-test-0000-volume","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2294","timestamp":"2025-02-15T14:32:25.439525Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
...
"Cant find the matching VM for DataVolume..." appears to be repeated for all VMs
Â
The rest of the log is filled with this error along with the occasional pdb size increase/decrease
"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request"
2025-02-15T16:15:29.970377919Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance default/vm-instancetype-cirros-test-1522","pos":"vmi.go:253","reason":"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request","timestamp":"2025-02-15T16:15:29.970260Z"}Â
Â
virt-handler errors seen:
2025-02-15T17:52:18.234412246Z {"component":"virt-handler","level":"error","msg":"failed to scrape metrics from //pods/77ef52ad-a310-4b0a-89dc-ad28ce92da55/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","pos":"scrapper.go:48","reason":"failed to connect to cmd client socket: context deadline exceeded","timestamp":"2025-02-15T17:52:18.225320Z"}
...
2025-02-15T17:53:30.083204851Z {"component":"virt-handler","kind":"","level":"error","msg":"target migration listener is not up for this vmi","name":"vm-instancetype-cirros-test-1986","namespace":"default","pos":"vm.go:819","timestamp":"2025-02-15T17:53:30.083162Z","uid":"2946fe73-5b4f-40ac-8f74-02a74ae78152"}
...
2025-02-15T16:31:29.977835795Z {"component":"virt-handler","level":"error","msg":"failed to scrape domain stats for VMI default/vm-instancetype-cirros-test-0668","pos":"queue.go:104","reason":"expected 1 value from DomainstatsScraper, got 0","timestamp":"2025-02-15T16:31:29.977796Z"}
Expected results:
All migrations successful
Additional info:
Load within VMs was very high, along with memory and disk activity. Reduced VM CPU load with no change. Increased OSD cpus from 4 to 8 to 12 due to OSDCPULoadHigh, had no effect.
Â
- blocks
-
CNV-54728 Performance upgrade testing for 4.18.0
-
- Closed
-
- clones
-
CNV-56658 VirtualMachineInstanceMigrations with undefined/blank phase during upgrade under load
-
- Closed
-
- is cloned by
-
CNV-59995 [CNV-4.17] Upgrade/Migration fails when combined pending and undefined migrations approaches 400
-
- POST
-
-
CNV-56998 RN: Known issue - VirtualMachineInstanceMigrations failing when combined pending and undefined migrations approaches 400
-
- Closed
-
-
CNV-59994 [CNV-4.19] Upgrade/Migration fails when combined pending and undefined migrations approaches 400
-
- Closed
-
-
CNV-59996 [CNV-4.16] Upgrade/Migration fails when combined pending and undefined migrations approaches 400
-
- Closed
-
- is related to
-
CNV-30386 [2218435] queueing multiple VMs migration causes virt-controller to hit a deadlock.
-
- POST
-
-
CNV-56912 The metric for kubevirt_vmi_migration_failed include VMIMs in 'undefined' status
-
- Closed
-
- is triggered by
-
CNV-54728 Performance upgrade testing for 4.18.0
-
- Closed
-
- is triggering
-
CNV-59991 virt-controller excessive info-level logging during migrations
-
- New
-
- links to
-
RHEA-2025:152505
OpenShift Virtualization 4.18.12 Images