Uploaded image for project: 'Migration Toolkit for Virtualization'
  1. Migration Toolkit for Virtualization
  2. MTV-1343

forklift-controller OOM during 100 VM/20+ plan migration from VMware

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Controller
    • None
    • False
    • None
    • True

      We are demo.redhat.com, proving out MTV mass migration for customer workshops.

       

      Our OCP+Virt clusters are running as guest VMs on OCP+Virt deployed on IBM bare metal.

       

      We are using mostly external Ceph storage.

       

      We have clusters available with similar failure conditions.  Will provide access credentials when this issue is picked up.

      Background from another cluster's attempt:

      MTV deployment of 40 users (40 plans, 160 VMs) began to fail.

      • OOM errors from the forklift-controller.  forklift-controller in CrashLoopBackoff
      • forklift-controller has two containers: main and inventory.
        • main is failing with OOM
        • inventory is OK

      We are having storage performance issues.  Look at what happened to User 30's experience here.

       

      There's a podforklift-volume-populator-controller that complained about pvc 'pvc/vmexamples-user30/prime-992a2ced-f6cb-4e67-a445-d806f20656e5' in work queue no longer exists

      PVC no longer exists?  Why?
      PVC plan-user30-vm-54068-r9kbn Error saving claim: Operation cannot be fulfilled on persistentvolumeclaims "plan-user30-vm-54068-r9kbn": the object has been modified; please apply your changes to the latest version and try again
      I found the following in Alerts
      When running VMs using ODF storage with 'rbd' mounter or 'rbd.csi.ceph.com provisioner', VMs may cause reports of bad crc/signature errors due to certain I/O patterns. Cluster performance can be severely degraded if the number of re-transmissions due to crc errors causes network saturation.100% of the rook-ceph-mgr-external/rook-ceph-mgr-external targets in Namespace NS openshift-storage namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
      the csi-rdbplugin-provisioner is disavowing all knowledge:

      and then in the csi-addons :
      I0814 13:56:03.470300 1 leaderelection.go:354] lock is held by csi-rbdplugin-provisioner-86679bf8fc-smnv6 and has not yet expired I0814 13:56:03.470331 1 leaderelection.go:255] failed to acquire lease openshift-storage/openshift-storage-rbd-csi-ceph-com-csi-addons
      csi-rbdplugin-provisioner-86679bf8fc-q8xtv:
      I0814 13:55:14.893726 1 controllerserver-default.go:42] ID: 21 Using default ControllerGetCapabilities
      85I0814 13:55:14.893856 1 utils.go:205] ID: 21 GRPC response: {"capabilities":[{"Type":{"Rpc":

      {"type":1}

      }},{"Type":{"Rpc":

      {"type":5}

      }},{"Type":{"Rpc":

      {"type":7}

      }},{"Type":{"Rpc":

      {"type":9}

      }}]}
      86I0814 13:55:44.624873 1 utils.go:198] ID: 22 GRPC call: /identity.Identity/GetCapabilities
      87I0814 13:55:44.624910 1 utils.go:199] ID: 22 GRPC request: {}
      88I0814 13:55:44.625032 1 utils.go:205] ID: 22 GRPC response: {"capabilities":[{"Type":{"Service":

      {"type":1}

      }},{"Type":{"ReclaimSpace":

      {"type":1}

      }},{"Type":{"NetworkFence":

      {"type":1}

      }},{"Type":{"VolumeReplication":

      {"type":1}

      }}]}
      89I0814 14:01:50.291479 1 utils.go:198] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 GRPC call: /csi.v1.Controller/CreateVolume
      90I0814 14:01:50.291670 1 utils.go:199] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 GRPC request: {"capacity_range":

      {"required_bytes":22548578304}

      ,"name":"pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23","parameters":

      {"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23","csi.storage.k8s.io/pvc/name":"prime-992a2ced-f6cb-4e67-a445-d806f20656e5","csi.storage.k8s.io/pvc/namespace":"vmexamples-user30","imageFeatures":"layering,deep-flatten,exclusive-lock,object-map,fast-diff","imageFormat":"2","pool":"ocpv01-tenants","volumeNamePrefix":"ocp4-cluster-vcch9-"}

      ,"secrets":"*{}stripped{}*","volume_capabilities":\\{"AccessType":{"Block":{}},"access_mode":{"mode":5}}}
      91I0814 14:01:50.291910 1 rbd_util.go:1317] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 setting disableInUseChecks: true image features: [object-map layering fast-diff deep-flatten exclusive-lock] mounter: rbd
      92I0814 14:01:50.331000 1 omap.go:89] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 got omap values: (pool="ocpv01-tenants", namespace="", name="csi.volumes.default"): map[]
      93I0814 14:01:50.334528 1 omap.go:159] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 set omap keys (pool="ocpv01-tenants", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23:8588445b-d9fc-4e74-81e6-2894d583345e])
      94I0814 14:01:50.335230 1 omap.go:159] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 set omap keys (pool="ocpv01-tenants", namespace="", name="csi.volume.8588445b-d9fc-4e74-81e6-2894d583345e"): map[csi.imagename:ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e csi.volname:pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 csi.volume.owner:vmexamples-user30])
      95I0814 14:01:50.335251 1 rbd_journal.go:491] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 generated Volume ID (0001-0011-openshift-storage-0000000000000017-8588445b-d9fc-4e74-81e6-2894d583345e) and image name (ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e) for request name (pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23)
      96I0814 14:01:50.335294 1 rbd_util.go:425] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 rbd: create ocpv01-tenants/ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e size 21504M (features: [exclusive-lock fast-diff deep-flatten layering object-map]) using mon 10.38.85.193:6789,10.38.85.191:6789,10.38.85.192:6789
      97I0814 14:01:50.335337 1 rbd_util.go:1565] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 setting image options on ocpv01-tenants/ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e
      98I0814 14:01:50.356525 1 controllerserver.go:763] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 created image ocpv01-tenants/ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e backed for request name pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23
      99I0814 14:01:50.367413 1 omap.go:159] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 set omap keys (pool="ocpv01-tenants", namespace="", name="csi.volume.8588445b-d9fc-4e74-81e6-2894d583345e"): map[csi.imageid:719a0fee6be261])
      100I0814 14:01:50.367542 1 utils.go:205] ID: 23 Req-ID: pvc-762c7e55-3d70-4c48-bac9-82f7eb834f23 GRPC response: {"volume":{"capacity_bytes":22548578304,"volume_context":

      {"clusterID":"openshift-storage","imageFeatures":"layering,deep-flatten,exclusive-lock,object-map,fast-diff","imageFormat":"2","imageName":"ocp4-cluster-vcch9-8588445b-d9fc-4e74-81e6-2894d583345e","journalPool":"ocpv01-tenants","pool":"ocpv01-tenants","volumeNamePrefix":"ocp4-cluster-vcch9-"}

      ,"volume_id":"0001-0011-openshift-storage-0000000000000017-8588445b-d9fc-4e74-81e6-2894d583345e"}}

      PVC Describe

      1. oc describe pvc plan-user1-vm-53836-gn822 -n vmexamples-user1
        Name:          plan-user1-vm-53836-gn822
        Namespace:     vmexamples-user1
        StorageClass:  ocs-external-storagecluster-ceph-rbd
        Status:        Bound
        Volume:        pvc-c7f4a338-5e5e-4ed7-97d8-8c55a79a22e1
        Labels:        app=containerized-data-importer
                       app.kubernetes.io/component=storage
                       app.kubernetes.io/managed-by=cdi-controller
                       app.kubernetes.io/part-of=hyperconverged-cluster
                       app.kubernetes.io/version=4.16.1
                       migration=8c8570a0-10ef-457c-b18e-0bf261d1f19b
                       plan=cda3a759-5f36-42f7-b3d6-98aa624f02b8
                       vmID=vm-53836
        Annotations:   cdi.kubevirt.io/createdForDataVolume: f616c33c-7069-40c1-bcc2-3fa09105505e
                       cdi.kubevirt.io/storage.condition.running: false
                       cdi.kubevirt.io/storage.condition.running.message:
                       cdi.kubevirt.io/storage.condition.running.reason: Completed
                       cdi.kubevirt.io/storage.contentType: kubevirt
                       cdi.kubevirt.io/storage.deleteAfterCompletion: false
                       cdi.kubevirt.io/storage.pod.phase: Succeeded
                       cdi.kubevirt.io/storage.pod.restarts: 0
                       cdi.kubevirt.io/storage.populator.progress: 100.0%
                       cdi.kubevirt.io/storage.preallocation.requested: false
                       cdi.kubevirt.io/storage.usePopulator: true
                       forklift.konveyor.io/disk-source: [workload_share_QHFNI] database_24/database_2.vmdk
                       migration: 8c8570a0-10ef-457c-b18e-0bf261d1f19b
                       plan: cda3a759-5f36-42f7-b3d6-98aa624f02b8
                       pv.kubernetes.io/bind-completed: yes
                       pv.kubernetes.io/bound-by-controller: yes
                       vmID: vm-53836
                       volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
                       volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
        Finalizers:    [kubernetes.io/pvc-protection
        Capacity:      5Gi
        Access Modes:  RWX
        VolumeMode:    Block
        DataSource:
          APIGroup:  cdi.kubevirt.io
          Kind:      VolumeImportSource
          Name:      volume-import-source-f616c33c-7069-40c1-bcc2-3fa09105505e
        Used By:     <none>
        Events:      <none>

         

      PV Describe

      1. oc describe pvc plan-user1-vm-53836-gn822 -n vmexamples-user1
        Name:          plan-user1-vm-53836-gn822
        Namespace:     vmexamples-user1
        StorageClass:  ocs-external-storagecluster-ceph-rbd
        Status:        Bound
        Volume:        pvc-c7f4a338-5e5e-4ed7-97d8-8c55a79a22e1
        Labels:        app=containerized-data-importer
                       app.kubernetes.io/component=storage
                       app.kubernetes.io/managed-by=cdi-controller
                       app.kubernetes.io/part-of=hyperconverged-cluster
                       app.kubernetes.io/version=4.16.1
                       migration=8c8570a0-10ef-457c-b18e-0bf261d1f19b
                       plan=cda3a759-5f36-42f7-b3d6-98aa624f02b8
                       vmID=vm-53836
        Annotations:   cdi.kubevirt.io/createdForDataVolume: f616c33c-7069-40c1-bcc2-3fa09105505e
                       cdi.kubevirt.io/storage.condition.running: false
                       cdi.kubevirt.io/storage.condition.running.message:
                       cdi.kubevirt.io/storage.condition.running.reason: Completed
                       cdi.kubevirt.io/storage.contentType: kubevirt
                       cdi.kubevirt.io/storage.deleteAfterCompletion: false
                       cdi.kubevirt.io/storage.pod.phase: Succeeded
                       cdi.kubevirt.io/storage.pod.restarts: 0
                       cdi.kubevirt.io/storage.populator.progress: 100.0%
                       cdi.kubevirt.io/storage.preallocation.requested: false
                       cdi.kubevirt.io/storage.usePopulator: true
                       forklift.konveyor.io/disk-source: [workload_share_QHFNI] database_24/database_2.vmdk
                       migration: 8c8570a0-10ef-457c-b18e-0bf261d1f19b
                       plan: cda3a759-5f36-42f7-b3d6-98aa624f02b8
                       pv.kubernetes.io/bind-completed: yes
                       pv.kubernetes.io/bound-by-controller: yes
                       vmID: vm-53836
                       volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
                       volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
        Finalizers:    [kubernetes.io/pvc-protection
        Capacity:      5Gi
        Access Modes:  RWX
        VolumeMode:    Block
        DataSource:
          APIGroup:  cdi.kubevirt.io
          Kind:      VolumeImportSource
          Name:      volume-import-source-f616c33c-7069-40c1-bcc2-3fa09105505e
        Used By:     <none>
        Events:      <none>
        (migration_test_env) [root@utility load_test_mtv]# oc describe pv pvc-c7f4a338-5e5e-4ed7-97d8-8c55a79a22e1
        Name:            pvc-c7f4a338-5e5e-4ed7-97d8-8c55a79a22e1
        Labels:          <none>
        Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
                         volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-rbd-provisioner
                         volume.kubernetes.io/provisioner-deletion-secret-namespace: openshift-storage
        Finalizers:      [external-provisioner.volume.kubernetes.io/finalizer kubernetes.io/pv-protection]
        StorageClass:    ocs-external-storagecluster-ceph-rbd
        Status:          Bound
        Claim:           vmexamples-user1/plan-user1-vm-53836-gn822
        Reclaim Policy:  Delete
        Access Modes:    RWX
        VolumeMode:      Block
        Capacity:        5Gi
        Node Affinity:   <none>
        Message:
        Source:
            Type:              CSI (a Container Storage Interface (CSI) volume source)
            Driver:            openshift-storage.rbd.csi.ceph.com
            FSType:
            VolumeHandle:      0001-0011-openshift-storage-0000000000000017-98f7cacf-5626-48df-9fb0-8f402021c98b
            ReadOnly:          false
            VolumeAttributes:      clusterID=openshift-storage
                                   imageFeatures=layering,deep-flatten,exclusive-lock,object-map,fast-diff
                                   imageFormat=2
                                   imageName=ocp4-cluster-vcch9-98f7cacf-5626-48df-9fb0-8f402021c98b
                                   journalPool=ocpv01-tenants
                                   pool=ocpv01-tenants
                                   storage.kubernetes.io/csiProvisionerIdentity=1723643714412-1686-openshift-storage.rbd.csi.ceph.com
                                   volumeNamePrefix=ocp4-cluster-vcch9-
        Events:                <none>

            eperezma Eugenio Pérez
            jmaltin Judd Maltin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: