Uploaded image for project: 'Migration Toolkit for Virtualization'
  1. Migration Toolkit for Virtualization
  2. MTV-3241

Investigate and Optimize Disk Integrity Verification for Large-Scale VM Migrations

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Future Sustainability
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • Hide

      An analysis was conducted on a 100GB test file to compare the performance of several disk integrity verification methods. The following experiments were run in a containerized environment, testing with 1, 2, and 4 CPU cores:

      • Full Scan (blksum): A multi-threaded, optimized C application using the blake3 hash.
      • Full Scan (Parallel CRC32): A custom Python implementation using the multiprocessing library.
      • Statistical Sampling: A probabilistic method reading ~3.6% of the disk to detect >=1MB errors with 99.99% confidence.

       

      Key Findings:

      • I/O Bound Performance: The primary bottleneck for full-scan methods is the disk's read speed. Both blksum and the parallel Python CRC32 script performed similarly, indicating they are both efficient enough to be limited by disk I/O rather than CPU processing.
      • Optimal CPU Scaling: Performance peaked at 2 CPU cores. Adding more cores (e.g., 4) did not decrease the runtime and, in some cases, slightly increased it due to overhead. This confirms the process is I/O-bound, as 2 cores are sufficient to process data as fast as the disk can supply it.
      • Statistical Sampling Inefficiency: The random sampling method was significantly slower than a full sequential scan. This is due to the performance penalty of random I/O (disk seek time) outweighing the benefit of reading less data.
      • Confidence Levels:
        • Full scan methods (blksum, CRC32) provide 100% confidence for detecting accidental corruption.
        • Statistical sampling provides 99.99% confidence but only for errors of a specified minimum size (e.g., 1MB).

      Conclusion: For ensuring complete data integrity, a parallel, full-scan checksum is the most reliable method. The blksum tool is a robust, pre-built solution that scales well up to the point of the I/O limit. Our custom parallel CRC32 script successfully replicated this performance, confirming the disk as the bottleneck.

      Show
      An analysis was conducted on a 100GB test file to compare the performance of several disk integrity verification methods. The following experiments were run in a containerized environment, testing with 1, 2, and 4 CPU cores: Full Scan (blksum): A multi-threaded, optimized C application using the blake3 hash. Full Scan (Parallel CRC32): A custom Python implementation using the multiprocessing library. Statistical Sampling: A probabilistic method reading ~3.6% of the disk to detect >=1MB errors with 99.99% confidence.   Key Findings: I/O Bound Performance: The primary bottleneck for full-scan methods is the disk's read speed. Both blksum and the parallel Python CRC32 script performed similarly, indicating they are both efficient enough to be limited by disk I/O rather than CPU processing. Optimal CPU Scaling: Performance peaked at 2 CPU cores . Adding more cores (e.g., 4) did not decrease the runtime and, in some cases, slightly increased it due to overhead. This confirms the process is I/O-bound, as 2 cores are sufficient to process data as fast as the disk can supply it. Statistical Sampling Inefficiency: The random sampling method was significantly slower than a full sequential scan. This is due to the performance penalty of random I/O (disk seek time) outweighing the benefit of reading less data. Confidence Levels: Full scan methods ( blksum , CRC32) provide 100% confidence for detecting accidental corruption. Statistical sampling provides 99.99% confidence but only for errors of a specified minimum size (e.g., 1MB). Conclusion: For ensuring complete data integrity, a parallel, full-scan checksum is the most reliable method. The blksum tool is a robust, pre-built solution that scales well up to the point of the I/O limit. Our custom parallel CRC32 script successfully replicated this performance, confirming the disk as the bottleneck.

      Verifying the integrity of large disks (100GB+) during VM migrations using a full checksum is a time-consuming process that can significantly extend maintenance windows. We need to identify the fastest and most efficient method to ensure data integrity with a high degree of confidence.

       

      Full results:
      https://docs.google.com/document/d/1alCNC5wRhZrWVNR_6fAS1Spe6yU0206Io3YElFNPMIU/edit?tab=t.0#heading=h.rpo38b11pwlo 

              rh-ee-aweinsto Amit Weinstock
              rgolan1@redhat.com Roy Golan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: