Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44623

VolumeAttachment does not reconcile on worker VM reboot

XMLWordPrintable

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      Fix volume attachment not being cleaned up on VM node reboot in kubevirt-csi-driver
      Show
      Fix volume attachment not being cleaned up on VM node reboot in kubevirt-csi-driver

      This is a clone of issue OCPBUGS-44350. The following is the description of the original issue:

      Description of problem:

      When a kubevirt-csi pod runs on a worker node of a Guest cluster, the underlying PVC from the infra/host cluster is attached to the Virtual Machine that is the worker node of the Guest cluster.
      
      That works well, but only until the VM is rebooted.
      
      After the VM is power cycled for some reason, the volumeattachment on the Guest cluster is still there and shows as attached.
      
      [guest cluster]# oc get volumeattachment
      NAME                                                                   ATTACHER          PV                                         NODE                         ATTACHED   AGE
      csi-976b6b166ef7ea378de9a350c9ef427c23e8c072dc6e76a392241d273c3effdb   csi.kubevirt.io   pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b   hostedcluster2-rlq9m-z2x88   true       39m
      
      But the VM does not have the hotplugged disk anymore (its not a persistent hotplug). Its not attached at all.
      
      It only has its rhcos disk and cloud-init after the reboot:
      
      [host cluster]# oc get vmi -n clusters-hostedcluster2 hostedcluster2-rlq9m-z2x88 -o yaml | yq '.status.volumeStatus'
      - name: cloudinitvolume
        size: 1048576
        target: vdb
      - name: rhcos
        persistentVolumeClaimInfo:
          accessModes:
            - ReadWriteOnce
          capacity:
            storage: 32Gi
          claimName: hostedcluster2-rlq9m-z2x88-rhcos
          filesystemOverhead: "0"
          requests:
            storage: "34359738368"
          volumeMode: Block
        target: vda
      
      The result is all workloads with PVCs now fail to start, as the hotplug is not triggered again. The worker node VM cannot find the disk:
      
      26s         Warning   FailedMount                                  pod/mypod                             MountVolume.MountDevice failed for volume "pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b" : rpc error: code = Unknown desc = couldn't find device by serial id
      
      So workload pods cannot start.

      Version-Release number of selected component (if applicable):

          OCP 4.17.3
          CNV 4.17.0
          MCE 2.7.0

      How reproducible:

          Always

      Steps to Reproduce:

          1. Have a pod running with a PV from kubevirt-csi in the guest cluster
          2. Shutdown the Worker VM running the Pod and start it again
          

      Actual results:

          Workloads fail to start after VM reboot

      Expected results:

          Hotplug the disk again and let workloads start

      Additional info:

          

              rhn-engineering-dvossel David Vossel
              openshift-crt-jira-prow OpenShift Prow Bot
              Liangquan Li Liangquan Li
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: