-
Bug
-
Resolution: Done
-
Major
-
None
-
5
-
False
-
-
False
-
CLOSED
-
CNV Virtualization Sprint 220, CNV Virtualization Sprint 221, CNV Virtualization Sprint 222, CNV Virtualization Sprint 231, CNV Virtualization Sprint 232
-
High
-
None
Some background:
-------------------------
I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster.
this setup is currently running 3000 VMs:
1500 RHEL 8.5 persistent storage VMs
500 Windows10 persistent storage VMs.
1000 Fedora Ephemeral storage VMs.
The workers are divided to 3 zones:
worker000 - worker031. = Zone0
worker032 - worker062. = Zone1
worker033 - worker096. = Zone2
I start the migration by applying an empty machineconfig to the zone
which then causes the nodes to start draining.
this is my migration config:
liveMigrationConfig:
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 11
parallelOutboundMigrationsPerNode: 22
progressTimeout: 150
workloads: {}
another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
spec:
kubeletConfig:
kubeAPIBurst: 200
kubeAPIQPS: 100
maxPods: 500
machineConfigPoolSelector:
matchLabels:
custom-kubelet: enabled
Issue number 1:
the first problem I encountered was that right after starting the migration,
it got stuck for a few hours and nothing happened.
I also tried to manually run virtctl migrate to a few of the vms there were scheduled on cordoned nodes, and the cli was failing due to timeouts.
I resolved that by patching the API with additional pods, this issue is already discussed at - https://github.com/kubevirt/kubevirt/issues/7101
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
annotations:
deployOVS: "false"
kubevirt.kubevirt.io/jsonpatch: '[{"op": "add", "path": "/spec/customizeComponents/patches",
"value": [{"resourceType": "Deployment", "resourceName": "virt-api", "type":
"json", "patch": "[
]"}]}]'
Issue number 2:
once the migration started to run I hoped that was it however a few VMs are currently failing to migrate due to various reasons, those are the VMs:
rhel82-vm0074 3d23h Migrating True
rhel82-vm0188 3d22h Migrating True
rhel82-vm0253 3d21h Migrating True
rhel82-vm0443 3d19h Migrating True
rhel82-vm0451 3d19h Migrating True
rhel82-vm0611 3d18h Migrating True
rhel82-vm0784 3d17h Migrating True
rhel82-vm1184 3d14h Migrating True
rhel82-vm1428 3d12h Migrating True
here are a few examples:
VM rhel82-vm0451 - running on worker031 failing due to Assertion on kvm_buf_set_msrs
---------------------------------------
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 148m disruptionbudget-controller Created Migration kubevirt-evacuation-mvr7r
Normal PreparingTarget 143m virt-handler Migration Target is listening at 10.131.44.5, on ports: 36763,37373
Normal PreparingTarget 143m (x24 over 11h) virt-handler VirtualMachineInstance Migration Target Prepared.
Warning Migrated 143m virt-handler VirtualMachineInstance migration uid b9fd0b54-26a5-4063-bd46-7b1e5dbeddd5 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 89m disruptionbudget-controller Created Migration kubevirt-evacuation-nvjbk
Normal PreparingTarget 85m virt-handler Migration Target is listening at 10.130.2.5, on ports: 37595,32775
Normal PreparingTarget 85m (x12 over 10h) virt-handler VirtualMachineInstance Migration Target Prepared.
Warning Migrated 80m virt-handler VirtualMachineInstance migration uid 7c9f1f23-43e3-4af7-b423-6b0088cf563f failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 35m disruptionbudget-controller Created Migration kubevirt-evacuation-7fstf
Normal PreparingTarget 33m virt-handler Migration Target is listening at 10.128.44.6, on ports: 38759,34661
Warning Migrated 27m virt-handler VirtualMachineInstance migration uid 5b8e1b88-1b03-4668-9e1a-755989f7c868 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T08:48:41.882103Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
Normal SuccessfulCreate 27m disruptionbudget-controller Created Migration kubevirt-evacuation-ts6nn
Normal PreparingTarget 23m virt-handler Migration Target is listening at 10.128.44.6, on ports: 41015,35881
Warning Migrated 17m virt-handler VirtualMachineInstance migration uid 74324d10-c606-41b2-8c8c-baeb94ccaa04 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 14m disruptionbudget-controller Created Migration kubevirt-evacuation-6sqfb
Normal SuccessfulUpdate 13m (x32 over 11h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-cnqj5
Normal PreparingTarget 8m53s (x2 over 8m53s) virt-handler Migration Target is listening at 10.128.44.6, on ports: 46429,34129
Normal PreparingTarget 8m52s (x13 over 33m) virt-handler VirtualMachineInstance Migration Target Prepared.
Normal Migrating 8m52s (x116 over 11h) virt-handler VirtualMachineInstance is migrating.
Normal SuccessfulUpdate 7m47s (x32 over 11h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-cnqj5)
Warning SyncFailed 3m40s (x32 over 10h) virt-handler server error. command Migrate failed: "migration job already executed"
Warning Migrated 3m40s virt-handler VirtualMachineInstance migration uid ed604e03-8658-4739-9875-95b88f2e0dd0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
VM rhel82-vm0660 - running on worker031 failing due to what seems to be a race condition
---------------------------------------
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 151m disruptionbudget-controller Created Migration kubevirt-evacuation-6zlsd
Normal PreparingTarget 145m virt-handler Migration Target is listening at 10.131.0.7, on ports: 45093,37935
Normal PreparingTarget 145m (x12 over 7h47m) virt-handler VirtualMachineInstance Migration Target Prepared.
Warning Migrated 140m virt-handler VirtualMachineInstance migration uid 09832eb9-bed5-403c-9020-3e2f586a41e7 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 92m disruptionbudget-controller Created Migration kubevirt-evacuation-n8267
Normal SuccessfulUpdate 91m (x11 over 11h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-twdcv
Normal PreparingTarget 88m virt-handler Migration Target is listening at 10.128.4.6, on ports: 39099,40877
Normal Migrating 88m (x35 over 11h) virt-handler VirtualMachineInstance is migrating.
Normal PreparingTarget 88m (x8 over 8h) virt-handler VirtualMachineInstance Migration Target Prepared.
Normal SuccessfulUpdate 87m (x11 over 11h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-twdcv)
Warning SyncFailed 83m (x11 over 11h) virt-handler server error. command Migrate failed: "migration job already executed"
Warning Migrated 83m virt-handler VirtualMachineInstance migration uid fb9e3a06-21ab-4e54-8e6e-861f44bbee36 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
---------------------------------------
VM rhel82-vm0836 - running on worker024 failing again to what seems to be a race condition, and failed API calls.
---------------------------------------
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 168m disruptionbudget-controller Created Migration kubevirt-evacuation-ddrdp
Normal PreparingTarget 165m virt-handler Migration Target is listening at 10.131.0.7, on ports: 46719,42575
Normal PreparingTarget 165m (x12 over 7h43m) virt-handler VirtualMachineInstance Migration Target Prepared.
Warning Migrated 159m virt-handler VirtualMachineInstance migration uid 61b4e7c8-1348-4bac-9768-6a6be9c129e0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 140m disruptionbudget-controller Created Migration kubevirt-evacuation-pnbpm
Normal PreparingTarget 135m (x9 over 3h34m) virt-handler VirtualMachineInstance Migration Target Prepared.
Normal PreparingTarget 135m (x2 over 135m) virt-handler Migration Target is listening at 10.130.30.5, on ports: 35207,37223
Warning Migrated 129m virt-handler VirtualMachineInstance migration uid acca7d9e-723e-4f3c-adad-97d432db3a1b failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:18:04.563275Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
Normal SuccessfulCreate 124m disruptionbudget-controller Created Migration kubevirt-evacuation-frtqg
Normal PreparingTarget 120m virt-handler Migration Target is listening at 10.131.44.5, on ports: 45403,37683
Normal PreparingTarget 120m (x8 over 8h) virt-handler VirtualMachineInstance Migration Target Prepared.
Warning Migrated 120m virt-handler VirtualMachineInstance migration uid 0f6f18b5-3fde-4beb-8a9c-c112b7f8da02 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
Normal SuccessfulCreate 107m disruptionbudget-controller Created Migration kubevirt-evacuation-wsjrc
Normal PreparingTarget 102m (x17 over 11h) virt-handler VirtualMachineInstance Migration Target Prepared.
Normal PreparingTarget 102m virt-handler Migration Target is listening at 10.130.2.5, on ports: 35511,35451
Warning Migrated 95m virt-handler VirtualMachineInstance migration uid ab614074-848a-4ee4-8c37-07d4fcfbd872 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:51:10.157656Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
Normal SuccessfulCreate 11m disruptionbudget-controller Created Migration kubevirt-evacuation-zv85q
Normal SuccessfulUpdate 10m (x38 over 16h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-ln2kq
Normal PreparingTarget 6m39s virt-handler Migration Target is listening at 10.128.44.6, on ports: 42829,45929
Normal Migrating 6m38s (x126 over 16h) virt-handler VirtualMachineInstance is migrating.
Normal PreparingTarget 6m38s (x4 over 6m39s) virt-handler VirtualMachineInstance Migration Target Prepared.
Normal SuccessfulUpdate 5m14s (x38 over 16h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-ln2kq)
Warning SyncFailed 95s (x38 over 16h) virt-handler server error. command Migrate failed: "migration job already executed"
Warning Migrated 95s virt-handler VirtualMachineInstance migration uid 1d146437-b031-47f7-accb-9ca42e960025 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
---------------------------------------
Versions of all relevant components:
CNV 4.9.2
RHCS 5.0
OCP 4.9.15
CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-failed-migration.tar.gz
- external trackers