Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-38193

No ability to evict a hotplugged vm this is blocking ocp upgrade

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Blocker Blocker
    • CNV v4.15.0
    • CNV v4.15.0
    • CNV Storage
    • None
    • 0.42
    • True
    • Hide

      OCP upgrade is being blocked due to hotplugged vm not evicting 

      Show
      OCP upgrade is being blocked due to hotplugged vm not evicting 
    • False
    • ---
    • ---
    • Storage Core Sprint 249
    • No

      Description of problem:

      OCP upgrade is being blocked due to hotplugged vm not evicting
      

      Version-Release number of selected component (if applicable):

      4.14.3 upgrade ocp to 4.15.0
      

      How reproducible:

      100% 
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get nodes
      NAME                                             STATUS                     ROLES                  AGE     VERSION
      cnv-qe-infra-29.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   6h6m    v1.28.6+0fb4726
      cnv-qe-infra-30.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   6h8m    v1.28.6+0fb4726
      cnv-qe-infra-31.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   6h8m    v1.28.6+0fb4726
      cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      worker                 4h53m   v1.28.6+0fb4726
      cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   worker                 4h51m   v1.27.10+28ed2d7
      cnv-qe-infra-34.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      worker                 4h53m   v1.28.6+0fb4726
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
      
      pdb:
      apiVersion: policy/v1
        kind: PodDisruptionBudget
        metadata:
          creationTimestamp: "2024-02-08T22:52:42Z"
          generateName: kubevirt-disruption-budget-
          generation: 25
          name: kubevirt-disruption-budget-qdcr9
          namespace: test-upgrade-namespace
          ownerReferences:
          - apiVersion: kubevirt.io/v1
            blockOwnerDeletion: true
            controller: true
            kind: VirtualMachineInstance
            name: fedora-hotplug-upg-1707432657-9790156
            uid: acb3931d-c57f-40d8-bfba-5e44e82ae8ab
          resourceVersion: "323478"
          uid: 88b6ae22-39fe-4da8-8d4a-52fcb33d98a6
        spec:
          minAvailable: 1
          selector:
            matchLabels:
              kubevirt.io/created-by: acb3931d-c57f-40d8-bfba-5e44e82ae8ab
        status:
          conditions:
          - lastTransitionTime: "2024-02-09T01:23:25Z"
            message: ""
            observedGeneration: 25
            reason: InsufficientPods
            status: "False"
            type: DisruptionAllowed
          currentHealthy: 1
          desiredHealthy: 1
          disruptionsAllowed: 0
          expectedPods: 13
          observedGeneration: 25
      

      VM:

      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get vm fedora-hotplug-upg-1707432657-9790156  -n test-upgrade-namespace -o yaml
      apiVersion: kubevirt.io/v1
      kind: VirtualMachine
      metadata:
        annotations:
          kubemacpool.io/transaction-timestamp: "2024-02-08T22:52:23.534961582Z"
          kubevirt.io/latest-observed-api-version: v1
          kubevirt.io/storage-observed-api-version: v1
        creationTimestamp: "2024-02-08T22:50:58Z"
        finalizers:
        - kubevirt.io/virtualMachineControllerFinalize
        generation: 3
        labels:
          created-by-dynamic-class-creator: "Yes"
          kubevirt.io/vm: fedora-hotplug-upg
        name: fedora-hotplug-upg-1707432657-9790156
        namespace: test-upgrade-namespace
        resourceVersion: "322703"
        uid: 12b277a7-5612-45df-8ee5-3c5c45be1a22
      spec:
        running: true
        template:
          metadata:
            creationTimestamp: null
            labels:
              debugLogs: "true"
              kubevirt.io/domain: fedora-hotplug-upg-1707432657-9790156
              kubevirt.io/vm: fedora-hotplug-upg-1707432657-9790156
          spec:
            architecture: amd64
            domain:
              cpu:
                cores: 1
              devices:
                disks:
                - disk:
                    bus: virtio
                  name: containerdisk
                - disk:
                    bus: virtio
                  name: cloudinitdisk
                - disk:
                    bus: scsi
                  name: blank-dv
                  serial: "1234567890"
                interfaces:
                - macAddress: 02:f5:7c:00:00:04
                  masquerade: {}
                  name: default
                rng: {}
              machine:
                type: pc-q35-rhel9.2.0
              resources:
                requests:
                  memory: 1Gi
            networks:
            - name: default
              pod: {}
            terminationGracePeriodSeconds: 30
            volumes:
            - containerDisk:
                image: quay.io/openshift-cnv/qe-cnv-tests-fedora:38@sha256:d0658b20dc8474caedd061f02ea4e5c3c35922a472f0ec141c264005291be2f3
              name: containerdisk
            - cloudInitNoCloud:
                userData: |-
                  #cloud-config
                  chpasswd:
                    expire: false
                  password: password
                  user: fedora
                  ssh_pwauth: true
      
                  ssh_authorized_keys:
                   [ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCj47ubVnxR16JU7ZfDli3N5QVBAwJBRh2xMryyjk5dtfugo5JIPGB2cyXTqEDdzuRmI+Vkb/A5duJyBRlA+9RndGGmhhMnj8and3wu5/cEb7DkF6ZJ25QV4LQx3K/i57LStUHXRTvruHOZ2nCuVXWqi7wSvz5YcvEv7O8pNF5uGmqHlShBdxQxcjurXACZ1YY0YDJDr3AJai1KF9zehVJODuSbrnOYpThVWGjFuFAnNxbtuZ8EOSougN2aYTf2qr/KFGDHtewIkzZmP6cjzKO5bN3pVbXxmb2Gces/BYHntY4MXBTUqwsmsCRC5SAz14bEP/vsLtrNhjq9vCS+BjMT root@exec1.rdocloud]
                  runcmd: ['grep ssh-rsa /etc/crypto-policies/back-ends/opensshserver.config || sudo update-crypto-policies --set LEGACY || true', "sudo sed -i 's/^#\\?PasswordAuthentication no/PasswordAuthentication yes/g' /etc/ssh/sshd_config", 'sudo systemctl enable sshd', 'sudo systemctl restart sshd']
              name: cloudinitdisk
            - dataVolume:
                hotpluggable: true
                name: blank-dv
              name: blank-dv
      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2024-02-08T22:53:09Z"
          status: "True"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: null
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: null
          status: "True"
          type: LiveMigratable
        - lastProbeTime: "2024-02-08T22:53:50Z"
          lastTransitionTime: null
          status: "True"
          type: AgentConnected
        created: true
        desiredGeneration: 3
        observedGeneration: 3
        printableStatus: Running
        ready: true
        volumeSnapshotStatuses:
        - enabled: false
          name: containerdisk
          reason: Snapshot is not supported for this volumeSource type [containerdisk]
        - enabled: false
          name: cloudinitdisk
          reason: Snapshot is not supported for this volumeSource type [cloudinitdisk]
        - enabled: true
          name: blank-dv
      

      Followings are found in machine config controller log:

      E0209 01:32:34.367292       1 render_controller.go:439] Error syncing Generated MCFG: %!w(*errors.StatusError=&{{{ } {   <nil>} Failure Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again Conflict 0xc0041148a0 409}})
      E0209 01:32:34.391824       1 render_controller.go:461] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again
      I0209 01:32:34.391846       1 render_controller.go:378] Error syncing machineconfigpool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again
      I0209 01:32:38.771452       1 drain_controller.go:152] evicting pod test-upgrade-namespace/virt-launcher-fedora-hotplug-upg-1707432657-9790156-nctjv
      E0209 01:32:38.788727       1 drain_controller.go:152] error when evicting pods/"virt-launcher-fedora-hotplug-upg-1707432657-9790156-nctjv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0209 01:32:39.258979       1 node_controller.go:1035] No nodes available for updates
      I0209 01:32:39.259123       1 status.go:224] Degraded Machine: cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com and Degraded Reason: failed to drain node: cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com after 1 hour. Please see machine-config-controller logs for more information
      I0209 01:32:43.789715       1 drain_controller.go:152] evicting pod test-upgrade-namespace/virt-launcher-fedora-hotplug-upg-1707432657-9790156-nctjv
      E0209 01:32:43.804143       1 drain_controller.go:152] error when evicting pods/"virt-launcher-fedora-hotplug-upg-1707432657-9790156-nctjv" -n "test-upgrade-namespace" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0209 01:32:44.291672       1 node_controller.go:1035] No nodes available for updates
      I0209 01:32:44.291877       1 status.go:224] Degraded Machine: cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com and Degraded Reason: failed to drain node: cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com after 1 hour. Please see machine-config-controller logs for more information
      

      I see all these virt pods in error state:

      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get pods -n test-upgrade-namespace 
      NAME                                                              READY   STATUS      RESTARTS   AGE
      hp-volume-d82rr                                                   0/1     Pending     0          62m
      virt-launcher-always-run-strategy-vm-1707432374-0880806-9rf26     1/1     Running     0          62m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-2rb5v         0/2     Error       0          62m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-49q59         0/2     Error       0          42m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-5h9tm         0/2     Error       0          56m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-7hmmx         0/2     Error       0          7m21s
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-k2dsr         0/2     Error       0          48m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-mxvd2         0/2     Error       0          59m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-nctjv         2/2     Running     0          162m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-nj528         0/2     Error       0          58m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-nzv7c         0/2     Error       0          19m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-qdj6p         0/2     Completed   0          24m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-v5bbs         0/2     Error       0          36m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-w4gtm         0/2     Completed   0          30m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-w6dpd         0/2     Completed   0          13m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-x9w9v         0/2     Error       0          53m
      virt-launcher-fedora-hotplug-upg-1707432657-9790156-zrhbg         0/2     Error       0          95s
      virt-launcher-manual-run-strategy-vm-1707432373-755645-f8gm8      1/1     Running     0          59m
      virt-launcher-vm-bridge-connected-1707433221-2068923-8cgff        2/2     Running     0          62m
      virt-launcher-vm-for-product-upgrade-ocs-1707432301-53998376q95   1/1     Running     0          85m
      virt-launcher-vm-snapshot-upgrade-a-1707432958-4519243-5bc6p      1/1     Running     0          85m
      virt-launcher-vma-macspoof-1707433341-3924508-6dptd               2/2     Running     0          106m
      virt-launcher-vmb-macspoof-1707433342-277827-pxgs2                2/2     Running     0          106m
      virt-launcher-windows-vm-1707432617-3495035-gdlnw                 1/1     Running     0          85m
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
      

      Expected results:

      OCP upgrade should continue successfully
      

      Additional info:

      Cluster is available for triage.  Will attach must gather
      

            akalenyu Alex Kalenyuk
            rhn-support-dbasunag Debarati Basu-Nag
            Jenia Peimer Jenia Peimer
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: