Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-51521

Incomplete transfer of change blocks leads to data corruption (vddk datasource)

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • CNV v4.99.0.rhel9-1748, CNV v4.15.8.rhel9-52
    • ---
    • ---
    • Storage Core Sprint 263
    • None

      Description of problem:

      The issue is in the vddk datasource.
      
      The issue looks to be because QueryChangedDiskAreas is only giving a maximum of “2000” changed blocks/query. Let me try to explain it below.
      On a 16 GiB disk, I created a snapshot.  I used the dd command to write 1 KiB of data at 2 MiB intervals as below:
      
      for i in $(seq 2048 2048 20480000); do dd if=/dev/urandom bs=1k seek=$i of=/dev/sdb count=1;done
      
      As per this, if I do  QueryChangedDiskAreas, it should show changes from on offset 2097152, 4194304, 6291456, 8388608 ………. 17177772032.
      I followed https://www.veeam.com/kb2092 to manually query the QueryChangedDiskAreas from vmware mob. When I passed 0 offset, the max number of changed blocks it was giving at a time was only 2000. So it stopped at  4194304000 offset and we are losing the rest. If we give the offset 4194304000, it shows the remaining 2000 changed blocks and so on.
       IIUC CDI is only querying from offset 0 https://github.com/kubevirt/containerized-data-importer/blob/448fe9d566964f0fbff4fa2f4f0f4904e5331838/pkg/importer/vddk-datasource_amd64.go#L861 , and if the changed block areas exceed 2000 blocks, we lose changes in those blocks.
      I did this same  test with fio while mtv is doing the backup and manually queries QueryChangedDiskAreas using vmware mob. When there is corruption, we have 2000+ block area changes. If I compare dump from sdb from the VM and the  downloaded image using cmp, the offset which shows the changes are the one which are not reported in QueryChangedDiskAreas with offset 0 (edited) 
      

      Version-Release number of selected component (if applicable):

      4.15.7
      

      How reproducible:

      Always if more than 2000 changed blocks.
      

      Steps to Reproduce:

      1. Import from VMware
      2. Change more than 2000 blocks in a snapshot (with high load)
      3. Observe that the pulled snapshot is incomplete (ie .leading to disk corrpution as described in MTV-1679)
      

      Actual results:

      Corrupted disk image
      

      Expected results:

      Consistent disk image
      

      Additional info:

      
      

              akalenyu Alex Kalenyuk
              rhn-support-nashok Nijin Ashok
              Kevin Alon Goldblatt Kevin Alon Goldblatt
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: