Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Unify dispatch region formation + Conversion to Flow passes#8662

Merged
MaheshRavishankar merged 4 commits into
iree-org:mainfrom
MaheshRavishankar:dispatch_region_formation
Mar 30, 2022
Merged

Unify dispatch region formation + Conversion to Flow passes#8662
MaheshRavishankar merged 4 commits into
iree-org:mainfrom
MaheshRavishankar:dispatch_region_formation

Conversation

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator

@MaheshRavishankar MaheshRavishankar commented Mar 28, 2022

There is a weird dependence between the
ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch
region formation itself. Its represents a weird separation of what
cant be moved into dispatches (due to bufferization issue), what needs
to be converted to flow operations and what cannot be converted to
flow operation but still need to be in dispatches.

The only way to resolve these dependencies is to unify these passes.
It also cleans up the dispatch region formation works, allow the
formation to either work on root + fused op or moving individual ops
into the dispatch.

Since the converion to flow happens within dispatch region the
following passes are unnecessary

  • ConvertToFlowAfterDispatchFormation
  • ConvertToFlowBeforeDisaptchFormation.
    A new pass ConvertToFlow is added just for testing the conversion
    patterns

@benvanik
Copy link
Copy Markdown
Collaborator

RE insert/extract slice folding: what are the conditions for that happening? One thing we want to make sure of is that we aren't introducing false dependencies - for dense slices it's better for example to produce a slice and then insert it as a separate op as otherwise if a dispatch needs to capture the whole target as an I/O all dispatches touching that target will be serialized. For scattered extracts/slices there's not much we can do, but if there's any chance of being able to take a subspan of a resource we should always do that.

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

RE insert/extract slice folding: what are the conditions for that happening? One thing we want to make sure of is that we aren't introducing false dependencies - for dense slices it's better for example to produce a slice and then insert it as a separate op as otherwise if a dispatch needs to capture the whole target as an I/O all dispatches touching that target will be serialized. For scattered extracts/slices there's not much we can do, but if there's any chance of being able to take a subspan of a resource we should always do that.

If we want to create the slice and in one op and insert it in another, then thats what will happen before this change, i.e. we dont fuse the insert_slice with the op that produces the slice. For "fusion with producer of padding" we need to fuse the insert_slice created with its producer. So that the slice created gets written into the right place of the "fused" operation. The latter is not yet prototyped. I can hide it behind a flag for later evaluation and tightening of the controls there. Thanks for the info.

@benvanik
Copy link
Copy Markdown
Collaborator

👍 Padding is great to fuse as it's non-contiguous and there's not much anyone else could do with it concurrently anyway. It's more subspan extracts/inserts that we want to ensure we don't fuse from slice/concat ops.

(FYI) Today we can exploit the concurrency when we have that pattern but don't yet place the allocations correctly resulting in an unneeded serialized copy per update. #7729 is tracking the work - basically, we really want to see those slice + dispatch + update sequences out of flow instead of just dispatches on the full resources (%44/%23/%57). Once that issue is fixed fusing any insertion that could otherwise be updates would always be a pessimization as both are zero-copy but the fusion approach blocks concurrency. It also impacts multi-device as instead of being able to move/stage the smaller subranges of the buffers across devices we'd have to move the entire resource when anything changed.

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

+1 Padding is great to fuse as it's non-contiguous and there's not much anyone else could do with it concurrently anyway. It's more subspan extracts/inserts that we want to ensure we don't fuse from slice/concat ops.

(FYI) Today we can exploit the concurrency when we have that pattern but don't yet place the allocations correctly resulting in an unneeded serialized copy per update. #7729 is tracking the work - basically, we really want to see those slice + dispatch + update sequences out of flow instead of just dispatches on the full resources (%44/%23/%57). Once that issue is fixed fusing any insertion that could otherwise be updates would always be a pessimization as both are zero-copy but the fusion approach blocks concurrency. It also impacts multi-device as instead of being able to move/stage the smaller subranges of the buffers across devices we'd have to move the entire resource when anything changed.

Sounds good. I might just drop this. To handle padding we should probably not split the padding early on, but split it during dispatch region formation. Padding needs a fill + insert, and the fill needs to happen separately. Will leave a note for @antiagainst to pick that up when he starts the pad fusion with producer work.

@benvanik
Copy link
Copy Markdown
Collaborator

I'm a fan of the cleanup here and if it lets us work around extra padding work then that's cool - just wanted to make sure we were preserving the simple slice/updates :)

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

I'm a fan of the cleanup here and if it lets us work around extra padding work then that's cool - just wanted to make sure we were preserving the simple slice/updates :)

Oh yeah for sure. Going to land this. I meant Ill drop the part that tries to fuse the tensor.insert_slice with its producers. Thats a small part of this.

@MaheshRavishankar MaheshRavishankar force-pushed the dispatch_region_formation branch from 0ee9817 to b236bdd Compare March 29, 2022 06:28
@iree-github-actions-bot
Copy link
Copy Markdown
Contributor

iree-github-actions-bot commented Mar 29, 2022

@MaheshRavishankar MaheshRavishankar force-pushed the dispatch_region_formation branch from b236bdd to ae9057a Compare March 29, 2022 21:16
@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

Fun. Regressions here were because the linalg.fill wasnt converted to flow.tensor.splat. Hopefully this fixes all regressions.

"@llvm-project//mlir:FuncDialect",
"@llvm-project//mlir:IR",
"@llvm-project//mlir:InferTypeOpIntefaces",
"@llvm-project//mlir:InferTypeOpInteface",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inteface -> Interface? I am curious if the typo must be maintained to work properly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thats why CI is failing. Thanks!

Copy link
Copy Markdown
Member

@okkwon okkwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Found some small things.

Comment thread iree/compiler/Codegen/Common/DestructiveUpdateUtils.cpp Outdated
Comment thread llvm-external-projects/iree-dialects/BUILD Outdated
flow.dispatch.tensor.store %7, %1, offsets = [%arg0], sizes = [2], strides = [1] : tensor<2xf32> -> !flow.dispatch.tensor<writeonly:10xf32>
}
return
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No CHECKs?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

Comment thread iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp Outdated
Mahesh Ravishankar added 3 commits March 29, 2022 17:27
There is a weird dependence between the
ConvertToFlowBefore/AfterDispatchRegionFormationPass and the dispatch
region formation itself. Its represents a weird separation of what
cant be moved into dispatches (due to bufferization issue), what needs
to be converted to flow operations and what cannot be converted to
flow operation but still need to be in dispatches.

The only way to resolve these dependencies is to unify these passes.
It also cleans up the dispatch region formation works, allow the
formation to either work on root + fused op or moving individual ops
into the dispatch.

Since the converion to flow happens within dispatch region the
following passes are unnecessary
- `ConvertToFlowAfterDispatchFormation`
- `ConvertToFlowBeforeDisaptchFormation`.
A new pass `ConvertToFlow` is added just for testing the conversion
patterns
Also add some tests to verify the correct lowering for concat
operations and for reshapes not being fused due to bufferization issue
(which is the current state anyway).
@MaheshRavishankar MaheshRavishankar force-pushed the dispatch_region_formation branch from ae9057a to bb1c636 Compare March 30, 2022 00:35
@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

MaheshRavishankar commented Mar 30, 2022

Have a very interesting case for performance regression here (link). Most of the above are noise. The only one worth looking into are the MobileBert Fp32 numbers.

For one, the only regress on Vulkan with Adreno. They improve with Mali and on CPU. So I'd consider this not a blocker. The difference though is interesting. Looking at the IR before outlining dispatch region with and without the changes. I see the only difference is in this chain of ops

g1 = gemm1(a, b) // M = 384, N = 128, K = 512
g2 = gemm2(c, d) // M = 384, N = 128, K = 128
result = elementwise_ops(g1, g2, ...)

Before this PR, the order was

%g2 = gemm2(c, d)
%result = fused_gemm_elementwise(a, b, g2...)

After this PR the order is

g1 = gemm1(a, b)
%result = fused_gemm_elementwise(c, d, g1...)

In theorry both should be exactly identical in terms of performance. The only difference is the K which does not feature in the elementwise operation, so it shouldnt matter which GEMM you fuse this with. The modified code actually should be better cause it does a better job of fusing slices with consumer dispatches so reduces some of that overhead. Wondering if it has something to do with the concurrency..

I am going to check one more case of the MobileSSD that doesnt seem to be noise, but I am not considering this case as a blocker, cause I cant really control this in the fusion, its just using SSA values, and really this both should be equivalent.

FYI @benvanik @antiagainst

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator Author

The MobileSSD seems like noise. Before and after the change the IR before outlining dispatches is identical.

@MaheshRavishankar MaheshRavishankar merged commit 45cb3f8 into iree-org:main Mar 30, 2022
@MaheshRavishankar MaheshRavishankar deleted the dispatch_region_formation branch March 30, 2022 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants