-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
False
-
-
False
-
CLOSED
-
---
-
---
-
-
CNV I/U Operators Sprint 242
-
High
-
No
Created attachment 1986222 [details]
one virtlauncher pod log
Created attachment 1986222 [details]
one virtlauncher pod log
Description of problem: During EUS->EUS upgrade between 4.12 and 4.14 (brew.registry.redhat.io/rh-osbs/iib:566591), when I enable workloadupdate strategy to Livemigrate (after CNV is upgraded to 4.14), automatic workload updates for all the livemigratable vms fails.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. pause worker mcp
2. turn off workloadupdate
3. upgrade to OCP 4.13,
4. upgrade to CNV 4.13 till the last z
5. upgrade to OCP 4.14
6. upgrade to CNV 4.14 till the last z
7. turn on workloadupdate strategy to LiveMigrate
8. unpause the worker mcp
Actual results:
Perform the above steps and at step 7 notice all vmims are failing.
================
test-upgrade-namespace kubevirt-workload-update-462pk Failed always-run-strategy-vm-1693420081-0430818
test-upgrade-namespace kubevirt-workload-update-gcx2q Failed always-run-strategy-vm-1693420081-0430818
test-upgrade-namespace kubevirt-workload-update-qvbqv Failed always-run-strategy-vm-1693420081-0430818
test-upgrade-namespace kubevirt-workload-update-st7kl PreparingTarget always-run-strategy-vm-1693420081-0430818
test-upgrade-namespace kubevirt-workload-update-zpxkv Failed always-run-strategy-vm-1693420081-0430818
test-upgrade-namespace kubevirt-workload-update-zrvsc Failed always-run-strategy-vm-1693420081-0430818
================
No successful vmim:
================
[cnv-qe-jenkins@cnv-qe-infra-01 eus]$ oc get vmim -A | grep -v Failed
NAMESPACE NAME PHASE VMI
kmp-enabled-for-upgrade kubevirt-workload-update-jv8wq Scheduling vm-upgrade-a-1693420859-1588397
kmp-enabled-for-upgrade kubevirt-workload-update-p4djk Pending vm-upgrade-b-1693420866-669033
test-upgrade-namespace kubevirt-evacuation-6gsch PreparingTarget vm-for-product-upgrade-nfs-1693419816-450729
test-upgrade-namespace kubevirt-workload-update-48g84 Scheduling vmb-macspoof-1693420728-3208427
test-upgrade-namespace kubevirt-workload-update-zflrn PreparingTarget manual-run-strategy-vm-1693420080-612376
[cnv-qe-jenkins@cnv-qe-infra-01 eus]$
=================
snippet from the virt launcher pod, full log would be attached.
================
panic: timed out waiting for domain to be defined
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 12 with status 512","pos":"virt-launcher-monitor.go:125","timestamp":"2023-08-31T01:27:32.893277Z"} {"component":"virt-launcher-monitor","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher-monitor.go:143","timestamp":"2023-08-31T01:27:32.893435Z"}================
I see many failed pods
[cnv-qe-jenkins@cnv-qe-infra-01 eus]$ oc get pods -n test-upgrade-namespace | grep always
virt-launcher-always-run-strategy-vm-1693420081-0430818-bpk4s 0/1 Error 0 21m
virt-launcher-always-run-strategy-vm-1693420081-0430818-gjh69 0/1 Error 0 84m
virt-launcher-always-run-strategy-vm-1693420081-0430818-kmdpg 1/1 Running 0 5m32s
virt-launcher-always-run-strategy-vm-1693420081-0430818-kvh84 0/1 Error 0 89m
virt-launcher-always-run-strategy-vm-1693420081-0430818-qwjx7 0/1 Error 0 53m
virt-launcher-always-run-strategy-vm-1693420081-0430818-tnlkz 1/1 Running 0 7h13m
virt-launcher-always-run-strategy-vm-1693420081-0430818-vjppc 0/1 Error 0 94m
virt-launcher-always-run-strategy-vm-1693420081-0430818-wkbxx 0/1 Error 0 38m
================
Please note two running virt launcher pods per vm
Virt controller log is flooding with these messages:
===========
==================
On unpausing the worker mcp, it fails to evict these vms off the node. Hence worker nodes never finishes updates. I see these error messages from machine-config-controller log:
==================
I0831 01:45:11.260097 1 drain_controller.go:350] Previous node drain found. Drain has been going on for 1.539504235401111 hours
E0831 01:45:11.260106 1 drain_controller.go:352] node cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0831 01:45:11.260120 1 drain_controller.go:173] node cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com: initiating drain
E0831 01:45:14.445756 1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed Pods: cnv-tests-utilities/utility-8wtr5, nvidia-gpu-operator/nvidia-sandbox-validator-jjx7c, nvidia-gpu-operator/nvidia-vfio-manager-j4ll8, openshift-cluster-node-tuning-operator/tuned-8cb8s, openshift-cnv/bridge-marker-cmrpw, openshift-cnv/hostpath-provisioner-csi-r9lmh, openshift-cnv/kube-cni-linux-bridge-plugin-lfh2m, openshift-cnv/virt-handler-x4zd4, openshift-dns/dns-default-mhzfx, openshift-dns/node-resolver-d9ctn, openshift-image-registry/node-ca-dxj5d, openshift-ingress-canary/ingress-canary-gpg62, openshift-local-storage/diskmaker-manager-vszh7, openshift-machine-config-operator/machine-config-daemon-p4n8n, openshift-monitoring/node-exporter-mq2hg, openshift-multus/multus-89nzg, openshift-multus/multus-additional-cni-plugins-849q2, openshift-multus/network-metrics-daemon-zbgcj, openshift-network-diagnostics/network-check-target-vd9qt, openshift-nfd/nfd-worker-dqxqf, openshift-nmstate/nmstate-handler-tj5pr, openshift-operators/istio-cni-node-v2-3-kkqzr, openshift-ovn-kubernetes/ovnkube-node-lznhh, openshift-storage/csi-cephfsplugin-jnfmt, openshift-storage/csi-rbdplugin-mzml8
I0831 01:45:14.447178 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:14.478812 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0831 01:45:19.478943 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:19.493153 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0831 01:45:24.496967 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:24.523950 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0831 01:45:29.524635 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:29.589530 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0831 01:45:34.589574 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:34.616233 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0831 01:45:39.616498 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
E0831 01:45:39.633715 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
[cnv-qe-jenkins@cnv-qe-infra-01 eus]$
Expected results:
EUS upgrade completes successfully
Additional info:
Live cluster is available
Must gather can be found here: https://drive.google.com/drive/folders/1q4ipWMM2Z4jti9yJK_HCnFswDfEHkptV?usp=drive_link