Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10255

Overlapping compilation time with benchmarking for the autotuner

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • 8
    • PyTorch Sprint 26

      Compilation time is often the most time-consuming component of auto-tuning. It is also highly skewed – there are a few configs with very large, outlier compilation times. Due to batched evaluation of configs, we must wait for all of the configs to complete compiling before beginning to benchmark. The presence of outliers makes this especially inefficient – often we are waiting on just a handful of configs to finish.

      <img width="4800" height="1800" alt="Image" src="https://github.com/user-attachments/assets/856770f7-4b0c-4fcc-80c7-7011c3c90fb2" />

      To address this, one approach is to *overlap compilation time with benchmarking*. As a result, we could start benchmarking before waiting for the outlier configs to finish. However, a key concern is that this could introduce bias in the benchmarking results for CPU-bound kernels. For now, we should probably give the user access to this as an experimental feature that is set off by default (i.e. introduce a HELION_AUTOTUNE_OVERLAP_COMPILATION flag).

      To verify the effect of this, we should run benchmarks on kernels with small shapes. Lets aim for super small (16x16, larger if necessary) matmul, layernorm, rmsnorm, softmax, cross-entropy kernels.

      @hinriksnaer mentioned that he is interested in this.

              rh-ee-hgudmund Hinrik Gudmundsson
              rh-ee-hgudmund Hinrik Gudmundsson
              PyTorch Compile
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: