-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
8
-
False
-
-
False
-
-
-
8
-
PyTorch Sprint 26
Compilation time is often the most time-consuming component of auto-tuning. It is also highly skewed – there are a few configs with very large, outlier compilation times. Due to batched evaluation of configs, we must wait for all of the configs to complete compiling before beginning to benchmark. The presence of outliers makes this especially inefficient – often we are waiting on just a handful of configs to finish.
<img width="4800" height="1800" alt="Image" src="https://github.com/user-attachments/assets/856770f7-4b0c-4fcc-80c7-7011c3c90fb2" />
To address this, one approach is to *overlap compilation time with benchmarking*. As a result, we could start benchmarking before waiting for the outlier configs to finish. However, a key concern is that this could introduce bias in the benchmarking results for CPU-bound kernels. For now, we should probably give the user access to this as an experimental feature that is set off by default (i.e. introduce a HELION_AUTOTUNE_OVERLAP_COMPILATION flag).
To verify the effect of this, we should run benchmarks on kernels with small shapes. Lets aim for super small (16x16, larger if necessary) matmul, layernorm, rmsnorm, softmax, cross-entropy kernels.
@hinriksnaer mentioned that he is interested in this.
- clones
-
AIPCC-9986 [Helion] Enable early termination of bad configs during optimization
-
- Review
-