-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
CNV v4.18.0
-
0.42
-
False
-
-
False
-
None
-
---
-
---
-
-
Critical
-
None
Description of problem:
VirtualMachineInstanceMigrations failing during upgrade under load
Version-Release number of selected component (if applicable):
4.18.0.rhel9-813
How reproducible:
Reproduced multiple times during the same workload update cycle.
Steps to Reproduce:
1. Deploy OCP/LSO/ODF/CNV 4.16 2. Perform EUS upgrade with internal VM load 3. Upgrade OCP/LSO/ODF/CNV to 4.18 4. Enable LiveMigrate prior to unpausing worker MCP
Actual results:
After 24h, only 436 VM have migrated successfully. The other 1564 VM still have the old virt-launcher. After scaling back CNV and resetting the migrations, as the combined number of Pending and Undefined(Blank) Migrations approaches 500, the migrations begin to fail again. Errors seen include: virt-controller: ... 2025-02-15T14:31:51.482528488Z W0215 14:31:51.482514 1 shared_informer.go:597] resyncPeriod 5m0s is smaller than resyncCheckPeriod 13h27m2.088196035s and the informer has already started. Changing it to 13h27m2.088196035s ... 2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"info","msg":"Looking for DataVolume Ref","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2283","timestamp":"2025-02-15T14:32:25.439454Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"} 2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"error","msg":"Cant find the matching VM for DataVolume: vm-instancetype-cirros-test-0000-volume","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2294","timestamp":"2025-02-15T14:32:25.439525Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"} ... "Cant find the matching VM for DataVolume..." appears to be repeated for all VMs The rest of the log is filled with this error along with the occasional pdb size increase/decrease "patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request" 2025-02-15T16:15:29.970377919Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance default/vm-instancetype-cirros-test-1522","pos":"vmi.go:253","reason":"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request","timestamp":"2025-02-15T16:15:29.970260Z"} virt-handler errors seen: 2025-02-15T17:52:18.234412246Z {"component":"virt-handler","level":"error","msg":"failed to scrape metrics from //pods/77ef52ad-a310-4b0a-89dc-ad28ce92da55/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","pos":"scrapper.go:48","reason":"failed to connect to cmd client socket: context deadline exceeded","timestamp":"2025-02-15T17:52:18.225320Z"} ... 2025-02-15T17:53:30.083204851Z {"component":"virt-handler","kind":"","level":"error","msg":"target migration listener is not up for this vmi","name":"vm-instancetype-cirros-test-1986","namespace":"default","pos":"vm.go:819","timestamp":"2025-02-15T17:53:30.083162Z","uid":"2946fe73-5b4f-40ac-8f74-02a74ae78152"} ... 2025-02-15T16:31:29.977835795Z {"component":"virt-handler","level":"error","msg":"failed to scrape domain stats for VMI default/vm-instancetype-cirros-test-0668","pos":"queue.go:104","reason":"expected 1 value from DomainstatsScraper, got 0","timestamp":"2025-02-15T16:31:29.977796Z"}
Expected results:
All migrations successful
Additional info:
Load within VMs was very high, along with memory and disk activity. Reduced VM CPU load with no change. Increased OSD cpus from 4 to 8 to 12 due to OSDCPULoadHigh, had no effect.
- blocks
-
CNV-54728 Performance upgrade testing for 4.18.0
-
- In Progress
-
- clones
-
CNV-56658 VirtualMachineInstanceMigrations with undefined/blank phase during upgrade under load
-
- New
-
- is related to
-
CNV-56912 The metric for kubevirt_vmi_migration_failed include VMIMs in 'undefined' status
-
- New
-
- is triggered by
-
CNV-54728 Performance upgrade testing for 4.18.0
-
- In Progress
-