Codestin Search App

MaheshRavishankar · 2022-03-28T23:36:24Z

There is a weird dependence between the
ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch
region formation itself. Its represents a weird separation of what
cant be moved into dispatches (due to bufferization issue), what needs
to be converted to flow operations and what cannot be converted to
flow operation but still need to be in dispatches.

The only way to resolve these dependencies is to unify these passes.
It also cleans up the dispatch region formation works, allow the
formation to either work on root + fused op or moving individual ops
into the dispatch.

Since the converion to flow happens within dispatch region the
following passes are unnecessary

ConvertToFlowAfterDispatchFormation
ConvertToFlowBeforeDisaptchFormation.
A new pass ConvertToFlow is added just for testing the conversion
patterns

benvanik · 2022-03-28T23:45:42Z

RE insert/extract slice folding: what are the conditions for that happening? One thing we want to make sure of is that we aren't introducing false dependencies - for dense slices it's better for example to produce a slice and then insert it as a separate op as otherwise if a dispatch needs to capture the whole target as an I/O all dispatches touching that target will be serialized. For scattered extracts/slices there's not much we can do, but if there's any chance of being able to take a subspan of a resource we should always do that.

MaheshRavishankar · 2022-03-28T23:52:51Z

RE insert/extract slice folding: what are the conditions for that happening? One thing we want to make sure of is that we aren't introducing false dependencies - for dense slices it's better for example to produce a slice and then insert it as a separate op as otherwise if a dispatch needs to capture the whole target as an I/O all dispatches touching that target will be serialized. For scattered extracts/slices there's not much we can do, but if there's any chance of being able to take a subspan of a resource we should always do that.

If we want to create the slice and in one op and insert it in another, then thats what will happen before this change, i.e. we dont fuse the insert_slice with the op that produces the slice. For "fusion with producer of padding" we need to fuse the insert_slice created with its producer. So that the slice created gets written into the right place of the "fused" operation. The latter is not yet prototyped. I can hide it behind a flag for later evaluation and tightening of the controls there. Thanks for the info.

benvanik · 2022-03-29T00:03:53Z

👍 Padding is great to fuse as it's non-contiguous and there's not much anyone else could do with it concurrently anyway. It's more subspan extracts/inserts that we want to ensure we don't fuse from slice/concat ops.

(FYI) Today we can exploit the concurrency when we have that pattern but don't yet place the allocations correctly resulting in an unneeded serialized copy per update. #7729 is tracking the work - basically, we really want to see those slice + dispatch + update sequences out of flow instead of just dispatches on the full resources (%44/%23/%57). Once that issue is fixed fusing any insertion that could otherwise be updates would always be a pessimization as both are zero-copy but the fusion approach blocks concurrency. It also impacts multi-device as instead of being able to move/stage the smaller subranges of the buffers across devices we'd have to move the entire resource when anything changed.

MaheshRavishankar · 2022-03-29T02:41:11Z

+1 Padding is great to fuse as it's non-contiguous and there's not much anyone else could do with it concurrently anyway. It's more subspan extracts/inserts that we want to ensure we don't fuse from slice/concat ops.

(FYI) Today we can exploit the concurrency when we have that pattern but don't yet place the allocations correctly resulting in an unneeded serialized copy per update. #7729 is tracking the work - basically, we really want to see those slice + dispatch + update sequences out of flow instead of just dispatches on the full resources (%44/%23/%57). Once that issue is fixed fusing any insertion that could otherwise be updates would always be a pessimization as both are zero-copy but the fusion approach blocks concurrency. It also impacts multi-device as instead of being able to move/stage the smaller subranges of the buffers across devices we'd have to move the entire resource when anything changed.

Sounds good. I might just drop this. To handle padding we should probably not split the padding early on, but split it during dispatch region formation. Padding needs a fill + insert, and the fill needs to happen separately. Will leave a note for @antiagainst to pick that up when he starts the pad fusion with producer work.

benvanik · 2022-03-29T02:53:56Z

I'm a fan of the cleanup here and if it lets us work around extra padding work then that's cool - just wanted to make sure we were preserving the simple slice/updates :)

MaheshRavishankar · 2022-03-29T03:22:39Z

I'm a fan of the cleanup here and if it lets us work around extra padding work then that's cool - just wanted to make sure we were preserving the simple slice/updates :)

Oh yeah for sure. Going to land this. I meant Ill drop the part that tries to fuse the tensor.insert_slice with its producers. Thats a small part of this.

iree-github-actions-bot · 2022-03-29T07:50:30Z

Abbreviated Benchmark Summary

@ commit 3e73d2c3bac4cc30fdfbb63cd8b30d268a344228 (no previous benchmark results to compare against since 5f6f2f64fb7e846b9057804e0c40387217c6d708)

Raw Benchmarks

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV2 [fp32,imagenet] (TFLite) full-inference,default-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	12	13	1
MobileNetV2 [fp32,imagenet] (TFLite) kernel-execution,experimental-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	7	7	0
MobileNetV2 [fp32,imagenet] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	9	10	1

[Top 3 out of 189 benchmark results showed]

For more information:

MaheshRavishankar · 2022-03-29T21:16:48Z

Fun. Regressions here were because the linalg.fill wasnt converted to flow.tensor.splat. Hopefully this fixes all regressions.

okkwon · 2022-03-29T22:28:28Z

        "@llvm-project//mlir:FuncDialect",
        "@llvm-project//mlir:IR",
-        "@llvm-project//mlir:InferTypeOpIntefaces",
+        "@llvm-project//mlir:InferTypeOpInteface",


Inteface -> Interface? I am curious if the typo must be maintained to work properly.

Oh thats why CI is failing. Thanks!

okkwon

Looks good to me. Found some small things.

okkwon · 2022-03-29T23:10:27Z

+    flow.dispatch.tensor.store %7, %1, offsets = [%arg0], sizes = [2], strides = [1] : tensor<2xf32> -> !flow.dispatch.tensor<writeonly:10xf32>
+  }
+  return
+}


Thanks for catching this!

There is a weird dependence between the ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch region formation itself. Its represents a weird separation of what cant be moved into dispatches (due to bufferization issue), what needs to be converted to flow operations and what cannot be converted to flow operation but still need to be in dispatches. The only way to resolve these dependencies is to unify these passes. It also cleans up the dispatch region formation works, allow the formation to either work on root + fused op or moving individual ops into the dispatch. Since the converion to flow happens within dispatch region the following passes are unnecessary - `ConvertToFlowAfterDispatchFormation` - `ConvertToFlowBeforeDisaptchFormation`. A new pass `ConvertToFlow` is added just for testing the conversion patterns

Also add some tests to verify the correct lowering for concat operations and for reshapes not being fused due to bufferization issue (which is the current state anyway).

MaheshRavishankar · 2022-03-30T01:58:03Z

Have a very interesting case for performance regression here (link). Most of the above are noise. The only one worth looking into are the MobileBert Fp32 numbers.

For one, the only regress on Vulkan with Adreno. They improve with Mali and on CPU. So I'd consider this not a blocker. The difference though is interesting. Looking at the IR before outlining dispatch region with and without the changes. I see the only difference is in this chain of ops

g1 = gemm1(a, b) // M = 384, N = 128, K = 512
g2 = gemm2(c, d) // M = 384, N = 128, K = 128
result = elementwise_ops(g1, g2, ...)

Before this PR, the order was

%g2 = gemm2(c, d)
%result = fused_gemm_elementwise(a, b, g2...)

After this PR the order is

g1 = gemm1(a, b)
%result = fused_gemm_elementwise(c, d, g1...)

In theorry both should be exactly identical in terms of performance. The only difference is the K which does not feature in the elementwise operation, so it shouldnt matter which GEMM you fuse this with. The modified code actually should be better cause it does a better job of fusing slices with consumer dispatches so reduces some of that overhead. Wondering if it has something to do with the concurrency..

I am going to check one more case of the MobileSSD that doesnt seem to be noise, but I am not considering this case as a blocker, cause I cant really control this in the fusion, its just using SSA values, and really this both should be equivalent.

FYI @benvanik @antiagainst

MaheshRavishankar · 2022-03-30T02:33:28Z

The MobileSSD seems like noise. Before and after the change the IR before outlining dispatches is identical.

MaheshRavishankar requested review from antiagainst and benvanik March 28, 2022 23:36

MaheshRavishankar requested a review from hanhanW as a code owner March 28, 2022 23:36

MaheshRavishankar added the buildkite:benchmark label Mar 28, 2022

MaheshRavishankar requested a review from okkwon March 28, 2022 23:37

MaheshRavishankar force-pushed the dispatch_region_formation branch from 0ee9817 to b236bdd Compare March 29, 2022 06:28

MaheshRavishankar force-pushed the dispatch_region_formation branch from b236bdd to ae9057a Compare March 29, 2022 21:16

benvanik approved these changes Mar 29, 2022

View reviewed changes

okkwon reviewed Mar 29, 2022

View reviewed changes

okkwon approved these changes Mar 29, 2022

View reviewed changes

Mahesh Ravishankar added 3 commits March 29, 2022 17:27

Make unfused linalg.fill op be converted to flow.tensor.splat.

230e494

Also add some tests to verify the correct lowering for concat operations and for reshapes not being fused due to bufferization issue (which is the current state anyway).

Fix bazel build.

bb1c636

MaheshRavishankar force-pushed the dispatch_region_formation branch from ae9057a to bb1c636 Compare March 30, 2022 00:35

Address comments.

3e73d2c

MaheshRavishankar merged commit 45cb3f8 into iree-org:main Mar 30, 2022

MaheshRavishankar deleted the dispatch_region_formation branch March 30, 2022 05:19

Conversation

MaheshRavishankar commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benvanik commented Mar 28, 2022

Uh oh!

MaheshRavishankar commented Mar 28, 2022

Uh oh!

benvanik commented Mar 29, 2022

Uh oh!

MaheshRavishankar commented Mar 29, 2022

Uh oh!

benvanik commented Mar 29, 2022

Uh oh!

MaheshRavishankar commented Mar 29, 2022

Uh oh!

iree-github-actions-bot commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abbreviated Benchmark Summary

Raw Benchmarks

Uh oh!

MaheshRavishankar commented Mar 29, 2022

Uh oh!

okkwon Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

MaheshRavishankar Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

okkwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

okkwon Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

MaheshRavishankar Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MaheshRavishankar commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaheshRavishankar commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaheshRavishankar commented Mar 28, 2022 •

edited

Loading

iree-github-actions-bot commented Mar 29, 2022 •

edited

Loading

MaheshRavishankar commented Mar 30, 2022 •

edited

Loading