Unify dispatch region formation + Conversion to Flow passes#8662
Conversation
|
RE insert/extract slice folding: what are the conditions for that happening? One thing we want to make sure of is that we aren't introducing false dependencies - for dense slices it's better for example to produce a slice and then insert it as a separate op as otherwise if a dispatch needs to capture the whole target as an I/O all dispatches touching that target will be serialized. For scattered extracts/slices there's not much we can do, but if there's any chance of being able to take a subspan of a resource we should always do that. |
If we want to create the slice and in one op and insert it in another, then thats what will happen before this change, i.e. we dont fuse the |
|
👍 Padding is great to fuse as it's non-contiguous and there's not much anyone else could do with it concurrently anyway. It's more subspan extracts/inserts that we want to ensure we don't fuse from slice/concat ops. (FYI) Today we can exploit the concurrency when we have that pattern but don't yet place the allocations correctly resulting in an unneeded serialized copy per update. #7729 is tracking the work - basically, we really want to see those slice + dispatch + update sequences out of flow instead of just dispatches on the full resources (%44/%23/%57). Once that issue is fixed fusing any insertion that could otherwise be updates would always be a pessimization as both are zero-copy but the fusion approach blocks concurrency. It also impacts multi-device as instead of being able to move/stage the smaller subranges of the buffers across devices we'd have to move the entire resource when anything changed. |
Sounds good. I might just drop this. To handle padding we should probably not split the padding early on, but split it during dispatch region formation. Padding needs a fill + insert, and the fill needs to happen separately. Will leave a note for @antiagainst to pick that up when he starts the pad fusion with producer work. |
|
I'm a fan of the cleanup here and if it lets us work around extra padding work then that's cool - just wanted to make sure we were preserving the simple slice/updates :) |
Oh yeah for sure. Going to land this. I meant Ill drop the part that tries to fuse the |
0ee9817 to
b236bdd
Compare
Abbreviated Benchmark Summary@ commit 3e73d2c3bac4cc30fdfbb63cd8b30d268a344228 (no previous benchmark results to compare against since 5f6f2f64fb7e846b9057804e0c40387217c6d708) Raw Benchmarks
[Top 3 out of 189 benchmark results showed] For more information: |
b236bdd to
ae9057a
Compare
|
Fun. Regressions here were because the |
| "@llvm-project//mlir:FuncDialect", | ||
| "@llvm-project//mlir:IR", | ||
| "@llvm-project//mlir:InferTypeOpIntefaces", | ||
| "@llvm-project//mlir:InferTypeOpInteface", |
There was a problem hiding this comment.
Inteface -> Interface? I am curious if the typo must be maintained to work properly.
There was a problem hiding this comment.
Oh thats why CI is failing. Thanks!
okkwon
left a comment
There was a problem hiding this comment.
Looks good to me. Found some small things.
| flow.dispatch.tensor.store %7, %1, offsets = [%arg0], sizes = [2], strides = [1] : tensor<2xf32> -> !flow.dispatch.tensor<writeonly:10xf32> | ||
| } | ||
| return | ||
| } |
There was a problem hiding this comment.
Thanks for catching this!
There is a weird dependence between the ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch region formation itself. Its represents a weird separation of what cant be moved into dispatches (due to bufferization issue), what needs to be converted to flow operations and what cannot be converted to flow operation but still need to be in dispatches. The only way to resolve these dependencies is to unify these passes. It also cleans up the dispatch region formation works, allow the formation to either work on root + fused op or moving individual ops into the dispatch. Since the converion to flow happens within dispatch region the following passes are unnecessary - `ConvertToFlowAfterDispatchFormation` - `ConvertToFlowBeforeDisaptchFormation`. A new pass `ConvertToFlow` is added just for testing the conversion patterns
Also add some tests to verify the correct lowering for concat operations and for reshapes not being fused due to bufferization issue (which is the current state anyway).
ae9057a to
bb1c636
Compare
|
Have a very interesting case for performance regression here (link). Most of the above are noise. The only one worth looking into are the MobileBert Fp32 numbers. For one, the only regress on Vulkan with Adreno. They improve with Mali and on CPU. So I'd consider this not a blocker. The difference though is interesting. Looking at the IR before outlining dispatch region with and without the changes. I see the only difference is in this chain of ops Before this PR, the order was After this PR the order is In theorry both should be exactly identical in terms of performance. The only difference is the K which does not feature in the elementwise operation, so it shouldnt matter which GEMM you fuse this with. The modified code actually should be better cause it does a better job of fusing slices with consumer dispatches so reduces some of that overhead. Wondering if it has something to do with the concurrency.. I am going to check one more case of the MobileSSD that doesnt seem to be noise, but I am not considering this case as a blocker, cause I cant really control this in the fusion, its just using SSA values, and really this both should be equivalent. |
|
The MobileSSD seems like noise. Before and after the change the IR before outlining dispatches is identical. |
There is a weird dependence between the
ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch
region formation itself. Its represents a weird separation of what
cant be moved into dispatches (due to bufferization issue), what needs
to be converted to flow operations and what cannot be converted to
flow operation but still need to be in dispatches.
The only way to resolve these dependencies is to unify these passes.
It also cleans up the dispatch region formation works, allow the
formation to either work on root + fused op or moving individual ops
into the dispatch.
Since the converion to flow happens within dispatch region the
following passes are unnecessary
ConvertToFlowAfterDispatchFormationConvertToFlowBeforeDisaptchFormation.A new pass
ConvertToFlowis added just for testing the conversionpatterns