-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
8
-
False
-
-
False
-
-
-
AIPCC Accelerators 23, AIPCC Accelerators 24, AIPCC Accelerators 25, AIPCC Accelerators 26
-
-
- 🐛 Describe the bug
-
In flex attention, you're responsible for creating a block mask to pass into flex attention. This mask can then be reused for subsequent flex attention calls. If you're trying to avoid recompilation of transformer blocks, the naive way of doing this results in at least two compiles: one where you compile and cache the block mask, and then another where you retrieve it. You can manually hoist the block mask compute out of the transformer block, but this can be sort of annoying.
What would be nice is if there was a way to ask Dynamo to hoist the block creation out, in the same way Python RNG is hoisted out. The idea is that you have a function that has only constant arguments, and we black box it for Dynamo. The user specifies that it is safe for this function to be reordered with respect to the rest of the compute. Then Dynamo will emit prelude bytecode that calls this function and feeds it into the Dynamo region.
This is different from `assume_constant_result` since the result is not constant; you have to run this function every Dynamo invocation.
We could also not do this and force people to manually hoist their code.
cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @anijain2305 @ailzhang @drisspg
-
-
- Versions
-
main
- clones
-
AIPCC-8113 Skipping combo kernel for large pointwise nodes
-
- In Progress
-