Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-32690

[2236344] Unable to perform EUS to EUS upgrade between 4.12 and 4.14 with workloads

    XMLWordPrintable

Details

    • CNV I/U Operators Sprint 242
    • High

    Description

      Created attachment 1986222 [details]
      one virtlauncher pod log

      Created attachment 1986222 [details]
      one virtlauncher pod log

      Description of problem: During EUS->EUS upgrade between 4.12 and 4.14 (brew.registry.redhat.io/rh-osbs/iib:566591), when I enable workloadupdate strategy to Livemigrate (after CNV is upgraded to 4.14), automatic workload updates for all the livemigratable vms fails.

      Version-Release number of selected component (if applicable):

      How reproducible:
      100%

      Steps to Reproduce:
      1. pause worker mcp
      2. turn off workloadupdate
      3. upgrade to OCP 4.13,
      4. upgrade to CNV 4.13 till the last z
      5. upgrade to OCP 4.14
      6. upgrade to CNV 4.14 till the last z
      7. turn on workloadupdate strategy to LiveMigrate
      8. unpause the worker mcp

      Actual results:
      Perform the above steps and at step 7 notice all vmims are failing.
      ================
      test-upgrade-namespace kubevirt-workload-update-462pk Failed always-run-strategy-vm-1693420081-0430818
      test-upgrade-namespace kubevirt-workload-update-gcx2q Failed always-run-strategy-vm-1693420081-0430818
      test-upgrade-namespace kubevirt-workload-update-qvbqv Failed always-run-strategy-vm-1693420081-0430818
      test-upgrade-namespace kubevirt-workload-update-st7kl PreparingTarget always-run-strategy-vm-1693420081-0430818
      test-upgrade-namespace kubevirt-workload-update-zpxkv Failed always-run-strategy-vm-1693420081-0430818
      test-upgrade-namespace kubevirt-workload-update-zrvsc Failed always-run-strategy-vm-1693420081-0430818
      ================

      No successful vmim:
      ================
      [cnv-qe-jenkins@cnv-qe-infra-01 eus]$ oc get vmim -A | grep -v Failed
      NAMESPACE NAME PHASE VMI
      kmp-enabled-for-upgrade kubevirt-workload-update-jv8wq Scheduling vm-upgrade-a-1693420859-1588397
      kmp-enabled-for-upgrade kubevirt-workload-update-p4djk Pending vm-upgrade-b-1693420866-669033
      test-upgrade-namespace kubevirt-evacuation-6gsch PreparingTarget vm-for-product-upgrade-nfs-1693419816-450729
      test-upgrade-namespace kubevirt-workload-update-48g84 Scheduling vmb-macspoof-1693420728-3208427
      test-upgrade-namespace kubevirt-workload-update-zflrn PreparingTarget manual-run-strategy-vm-1693420080-612376
      [cnv-qe-jenkins@cnv-qe-infra-01 eus]$
      =================
      snippet from the virt launcher pod, full log would be attached.
      ================

      {"component":"virt-launcher","level":"info","msg":"Thread 32 (rpc-virtqemud) finished job remoteDispatchConnectListAllDomains with ret=0","pos":"virThreadJobClear:118","subcomponent":"libvirt","thread":"32","timestamp":"2023-08-31T01:27:23.889000Z"}

      panic: timed out waiting for domain to be defined

      {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 12 with status 512","pos":"virt-launcher-monitor.go:125","timestamp":"2023-08-31T01:27:32.893277Z"} {"component":"virt-launcher-monitor","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher-monitor.go:143","timestamp":"2023-08-31T01:27:32.893435Z"}

      ================
      I see many failed pods
      [cnv-qe-jenkins@cnv-qe-infra-01 eus]$ oc get pods -n test-upgrade-namespace | grep always
      virt-launcher-always-run-strategy-vm-1693420081-0430818-bpk4s 0/1 Error 0 21m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-gjh69 0/1 Error 0 84m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-kmdpg 1/1 Running 0 5m32s
      virt-launcher-always-run-strategy-vm-1693420081-0430818-kvh84 0/1 Error 0 89m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-qwjx7 0/1 Error 0 53m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-tnlkz 1/1 Running 0 7h13m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-vjppc 0/1 Error 0 94m
      virt-launcher-always-run-strategy-vm-1693420081-0430818-wkbxx 0/1 Error 0 38m

      ================
      Please note two running virt launcher pods per vm

      Virt controller log is flooding with these messages:
      ===========

      {"component":"virt-controller","kind":"","level":"error","msg":"failed to sync dynamic pod labels during sync: pods \"virt-launcher-always-run-strategy-vm-1693420081-0430818-tnlkz\" is forbidden: unable to validate against any security context constraint: [provider \"anyuid\": Forbidden: not usable by user or serviceaccount, provider \"pipelines-scc\": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.seLinuxOptions.level: Invalid value: \"\": must be s0:c28,c12, provider restricted-v2: .spec.securityContext.seLinuxOptions.type: Invalid value: \"virt_launcher.process\": must be , provider restricted-v2: .containers[0].runAsUser: Invalid value: 107: must be in the ranges: [1000780000, 1000789999], provider restricted-v2: .containers[0].seLinuxOptions.level: Invalid value: \"\": must be s0:c28,c12, provider restricted-v2: .containers[0].seLinuxOptions.type: Invalid value: \"virt_launcher.process\": must be , provider restricted-v2: .containers[0].capabilities.add: Invalid value: \"SYS_PTRACE\": capability may not be added, provider \"restricted\": Forbidden: not usable by user or serviceaccount, provider \"containerized-data-importer\": Forbidden: not usable by user or serviceaccount, provider \"nonroot-v2\": Forbidden: not usable by user or serviceaccount, provider \"nonroot\": Forbidden: not usable by user or serviceaccount, provider \"hostmount-anyuid\": Forbidden: not usable by user or serviceaccount, provider kubevirt-controller: .containers[0].capabilities.add: Invalid value: \"SYS_PTRACE\": capability may not be added, provider \"machine-api-termination-handler\": Forbidden: not usable by user or serviceaccount, provider \"bridge-marker\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork-v2\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork\": Forbidden: not usable by user or serviceaccount, provider \"hostaccess\": Forbidden: not usable by user or serviceaccount, provider \"nfd-worker\": Forbidden: not usable by user or serviceaccount, provider \"hostpath-provisioner-csi\": Forbidden: not usable by user or serviceaccount, provider \"linux-bridge\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-gpu-feature-discovery\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-mig-manager\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-node-status-exporter\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-operator-validator\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-sandbox-validator\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-vgpu-manager\": Forbidden: not usable by user or serviceaccount, provider \"ovs-cni-marker\": Forbidden: not usable by user or serviceaccount, provider \"kubevirt-handler\": Forbidden: not usable by user or serviceaccount, provider \"rook-ceph\": Forbidden: not usable by user or serviceaccount, provider \"node-exporter\": Forbidden: not usable by user or serviceaccount, provider \"rook-ceph-csi\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-dcgm\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-dcgm-exporter\": Forbidden: not usable by user or serviceaccount, provider \"privileged\": Forbidden: not usable by user or serviceaccount]","name":"virt-launcher-always-run-strategy-vm-1693420081-0430818-tnlkz","namespace":"test-upgrade-namespace","pos":"vmi.go:458","timestamp":"2023-08-31T01:27:41.786479Z","uid":"c17c47ec-f4f6-47bc-a627-057aef26042c"} {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance test-upgrade-namespace/always-run-strategy-vm-1693420081-0430818","pos":"vmi.go:322","reason":"error syncing labels to pod: pods \"virt-launcher-always-run-strategy-vm-1693420081-0430818-tnlkz\" is forbidden: unable to validate against any security context constraint: [provider \"anyuid\": Forbidden: not usable by user or serviceaccount, provider \"pipelines-scc\": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.seLinuxOptions.level: Invalid value: \"\": must be s0:c28,c12, provider restricted-v2: .spec.securityContext.seLinuxOptions.type: Invalid value: \"virt_launcher.process\": must be , provider restricted-v2: .containers[0].runAsUser: Invalid value: 107: must be in the ranges: [1000780000, 1000789999], provider restricted-v2: .containers[0].seLinuxOptions.level: Invalid value: \"\": must be s0:c28,c12, provider restricted-v2: .containers[0].seLinuxOptions.type: Invalid value: \"virt_launcher.process\": must be , provider restricted-v2: .containers[0].capabilities.add: Invalid value: \"SYS_PTRACE\": capability may not be added, provider \"restricted\": Forbidden: not usable by user or serviceaccount, provider \"containerized-data-importer\": Forbidden: not usable by user or serviceaccount, provider \"nonroot-v2\": Forbidden: not usable by user or serviceaccount, provider \"nonroot\": Forbidden: not usable by user or serviceaccount, provider \"hostmount-anyuid\": Forbidden: not usable by user or serviceaccount, provider kubevirt-controller: .containers[0].capabilities.add: Invalid value: \"SYS_PTRACE\": capability may not be added, provider \"machine-api-termination-handler\": Forbidden: not usable by user or serviceaccount, provider \"bridge-marker\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork-v2\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork\": Forbidden: not usable by user or serviceaccount, provider \"hostaccess\": Forbidden: not usable by user or serviceaccount, provider \"nfd-worker\": Forbidden: not usable by user or serviceaccount, provider \"hostpath-provisioner-csi\": Forbidden: not usable by user or serviceaccount, provider \"linux-bridge\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-gpu-feature-discovery\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-mig-manager\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-node-status-exporter\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-operator-validator\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-sandbox-validator\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-vgpu-manager\": Forbidden: not usable by user or serviceaccount, provider \"ovs-cni-marker\": Forbidden: not usable by user or serviceaccount, provider \"kubevirt-handler\": Forbidden: not usable by user or serviceaccount, provider \"rook-ceph\": Forbidden: not usable by user or serviceaccount, provider \"node-exporter\": Forbidden: not usable by user or serviceaccount, provider \"rook-ceph-csi\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-dcgm\": Forbidden: not usable by user or serviceaccount, provider \"nvidia-dcgm-exporter\": Forbidden: not usable by user or serviceaccount, provider \"privileged\": Forbidden: not usable by user or serviceaccount]","timestamp":"2023-08-31T01:27:41.786552Z"} {"component":"virt-controller","kind":"","level":"info","msg":"Marked Migration test-upgrade-namespace/kubevirt-workload-update-gcx2q failed on vmi due to target pod disappearing before migration kicked off.","name":"always-run-strategy-vm-1693420081-0430818","namespace":"test-upgrade-namespace","pos":"migration.go:827","timestamp":"2023-08-31T01:27:43.846173Z","uid":"ace4d971-9e13-4229-b322-d529e3216f95"}

      ==================
      On unpausing the worker mcp, it fails to evict these vms off the node. Hence worker nodes never finishes updates. I see these error messages from machine-config-controller log:
      ==================
      I0831 01:45:11.260097 1 drain_controller.go:350] Previous node drain found. Drain has been going on for 1.539504235401111 hours
      E0831 01:45:11.260106 1 drain_controller.go:352] node cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0831 01:45:11.260120 1 drain_controller.go:173] node cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com: initiating drain
      E0831 01:45:14.445756 1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed Pods: cnv-tests-utilities/utility-8wtr5, nvidia-gpu-operator/nvidia-sandbox-validator-jjx7c, nvidia-gpu-operator/nvidia-vfio-manager-j4ll8, openshift-cluster-node-tuning-operator/tuned-8cb8s, openshift-cnv/bridge-marker-cmrpw, openshift-cnv/hostpath-provisioner-csi-r9lmh, openshift-cnv/kube-cni-linux-bridge-plugin-lfh2m, openshift-cnv/virt-handler-x4zd4, openshift-dns/dns-default-mhzfx, openshift-dns/node-resolver-d9ctn, openshift-image-registry/node-ca-dxj5d, openshift-ingress-canary/ingress-canary-gpg62, openshift-local-storage/diskmaker-manager-vszh7, openshift-machine-config-operator/machine-config-daemon-p4n8n, openshift-monitoring/node-exporter-mq2hg, openshift-multus/multus-89nzg, openshift-multus/multus-additional-cni-plugins-849q2, openshift-multus/network-metrics-daemon-zbgcj, openshift-network-diagnostics/network-check-target-vd9qt, openshift-nfd/nfd-worker-dqxqf, openshift-nmstate/nmstate-handler-tj5pr, openshift-operators/istio-cni-node-v2-3-kkqzr, openshift-ovn-kubernetes/ovnkube-node-lznhh, openshift-storage/csi-cephfsplugin-jnfmt, openshift-storage/csi-rbdplugin-mzml8
      I0831 01:45:14.447178 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:14.478812 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0831 01:45:19.478943 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:19.493153 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0831 01:45:24.496967 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:24.523950 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0831 01:45:29.524635 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:29.589530 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0831 01:45:34.589574 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:34.616233 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0831 01:45:39.616498 1 drain_controller.go:144] evicting pod test-upgrade-namespace/virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv
      E0831 01:45:39.633715 1 drain_controller.go:144] error when evicting pods/"virt-launcher-vm-for-product-upgrade-nfs-1693419816-450729t5ppv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      [cnv-qe-jenkins@cnv-qe-infra-01 eus]$

      Expected results:
      EUS upgrade completes successfully

      Additional info:
      Live cluster is available
      Must gather can be found here: https://drive.google.com/drive/folders/1q4ipWMM2Z4jti9yJK_HCnFswDfEHkptV?usp=drive_link

      Attachments

        Activity

          People

            lpivarc Luboslav Pivarc
            rhn-support-dbasunag Debarati Basu-Nag
            Debarati Basu-Nag Debarati Basu-Nag
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: