-
Notifications
You must be signed in to change notification settings - Fork 812
[CUDA] Add pipeline to target tensorcore ops #7655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
609452c to
5c2a787
Compare
|
Depends https://reviews.llvm.org/D113618 |
| entryPoint.walk([&fusedOpSupported](linalg::GenericOp linalgOp) { | ||
| for (Operation &fusedOp : linalgOp.getOps()) { | ||
| if (!isa<arith::AddFOp, arith::MulFOp, MaxFOp, MinFOp, linalg::YieldOp, | ||
| linalg::GenericOp, arith::DivFOp>(fusedOp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot have GenericOp inside GenericOp anyway right? Isn't that verified by the op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops left over from a hack I had, thanks for catching this.
| // tile size. | ||
| for (TileWorkgroupSizePair &config : TCtileSizeConfig) { | ||
| if (sizeN % config.tileSize[1] == 0 && | ||
| sizeM % config.tileSize[0] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check K here? Is that handled somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added it.
|
|
||
| static constexpr int32_t kNumGPUDims = 3; | ||
|
|
||
| static constexpr int32_t kWarpSize = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that these declarations are not put in a namespace. Could you put them in the proper namespace to avoid leaking to the global namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are marked as static so they don't pollute anything outside of the files including this header similar to the static functions.
Adding mlir and iree namespace for the whole file.
| llvm::cl::init(false)); | ||
|
|
||
| static llvm::cl::opt<std::string> clTargetChip( | ||
| "iree-cuda-llvm-target-arch", llvm::cl::desc("LLVM target chip"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the arch number is not llvm specific I think? Why not just call it iree-cuda-target-arch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it is meant to match llvm option for -march. I believe we will want more options matching LLVM target machine in the future and I think it makes sense to keep it in sync with llvm.
iree/compiler/Codegen/Passes.h
Outdated
| /// Convert Linalg ops to Vector. | ||
| std::unique_ptr<OperationPass<FuncOp>> createLLVMGPUVectorizationPass(); | ||
|
|
||
| /// Convert Linalg ops to Vector and prepare converstion to gpu mma ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: GPU MMA
| return procInfo; | ||
| } | ||
|
|
||
| /// Compute subgroup ID. CUDA doesn't have a subgroupId equivalent so we are are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impl should be in a .cpp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this has a strong assumption that each warp is responsible for (X, Y, Z) = (<warp-size>, <full-Y>, <full-Z>) threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impl should be in a .cpp file?
True, I currently don't have a cpp file. I'd rather send a NFC change follow this PR as this PR is already big and I would rather not include extra refactoring changes. Is that okay with you?
So this has a strong assumption that each warp is responsible for
(X, Y, Z)=(<warp-size>, <full-Y>, <full-Z>)threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)
Good point added a comment.
| Identifier::get(getWorkgroupKTiledMarker(), context))); | ||
| } | ||
|
|
||
| /// Patterns for thread level tiling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is warp level right? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
|
|
||
| /// Patterns for thread level tiling. | ||
| static void populateTilingToWarpPatterns( | ||
| MLIRContext *context, OwningRewritePatternList &patterns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to pass in context? It can be queried from patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad copy paste indeed.
| mlir::Value subgroupId = | ||
| builder.create<mlir::gpu::ThreadIdOp>(loc, indexType, attr); | ||
| if (i == 0) { | ||
| mlir::AffineExpr d0 = getAffineDimExpr(0, builder.getContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: builder.getAffineDimExpr(0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks fine to me. Few nits. Please ping me when it is actually ready to land.
| tileY = sizeM; | ||
| workgroupSize = {sizeM, sizeN, 1}; | ||
| bool foundTensorCoreConfig = false; | ||
| if (supportsTensorCore(entryPoint, op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in terms of readability, move this to another function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add, one way to make things more readable is to think of each piece following the logic
if (is_lowering_config_set) {
...
set_lowering_config
}
Then its a matter of ordering these different pieces. That makes the code more readable IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the logic to a sequence of:
if(!isLoweringConfigSet) {
if(match) {
...
isLoweringConfigSet = true;
}
}
if(!isLoweringConfigSet) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see a point of isLoweringConfigSet here either. Basically
if(getLoweringConfig(op))` {
...
}
is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are
- Check if configuration is already set. If so do nothing.
- Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.
This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.
| /// operations. | ||
| static void getTensorCoreConfig( | ||
| SmallVectorImpl<TileWorkgroupSizePair> &tileSizes) { | ||
| tileSizes.push_back(TileWorkgroupSizePair({{64, 64, 16}, {64, 2, 1}})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tileSizes.emplace_back({{..}, {..}}) also works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
Actually can't get the compiler to take this. Somehow it is not able to infer the right type for the std::array when trying to do that.
| getTensorCoreConfig(TCtileSizeConfig); | ||
| // Pick the best configuration where the original shape is aligned on the | ||
| // tile size. | ||
| for (TileWorkgroupSizePair &config : TCtileSizeConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : This is why I dislike side data structures that are just carrying information to just populate another data structure. It adds cognitive overhead to reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to separate the configuration data from the logic. I don't really have a better solution at the time. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left comment above.
| : public OpRewritePattern<vector::BroadcastOp> { | ||
| using OpRewritePattern<vector::BroadcastOp>::OpRewritePattern; | ||
|
|
||
| LogicalResult matchAndRewrite(vector::BroadcastOp op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in IREE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a bit of an ordering issue upstream. After my changes to vectorization we do lowering of transfer op permutation maps early but some other transformations like vector to mma need the broadcast/transpose to be merged in the transfer op to work. This is something I need to figure out but since I haven't figured out yet which is the ideal solution I have this in IREE for now.
| pm.addPass(createCSEPass()); | ||
|
|
||
| // Distribute linalg onto warps within the workgroup. | ||
| pm.addNestedPass<FuncOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this pass needs to be split into three.
- The current tile + distribute without lowering to loops
- The warp distribution
- The lowering to loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it should be split. It should be one part reduction tiling + promotion, then tile and distribute to warp or threads.
Not sure what you mean by lowering to loops. We don't really have any lowering to loops in this path?
I was considering doing this change here but the PR is already big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.
This is currently enabled only if the target architecture is set to sm_80. This adds a flag to specify the target architecture for CUDA target.
6f1c9c7 to
ac7a066
Compare
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally this is fine. Would really recommend some additional clean up PRs before things get more involved. I left a few comments in place, the biggest thing to clean up is to avoid having duplicate data structures. The goal is that you materialize everything in the IR when you need to, and then just query the IR. Having separate channels just makes things more complicated.
| tileY = sizeM; | ||
| workgroupSize = {sizeM, sizeN, 1}; | ||
| bool foundTensorCoreConfig = false; | ||
| if (supportsTensorCore(entryPoint, op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see a point of isLoweringConfigSet here either. Basically
if(getLoweringConfig(op))` {
...
}
is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are
- Check if configuration is already set. If so do nothing.
- Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.
This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.
| getTensorCoreConfig(TCtileSizeConfig); | ||
| // Pick the best configuration where the original shape is aligned on the | ||
| // tile size. | ||
| for (TileWorkgroupSizePair &config : TCtileSizeConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left comment above.
| pm.addPass(createCSEPass()); | ||
|
|
||
| // Distribute linalg onto warps within the workgroup. | ||
| pm.addNestedPass<FuncOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.
ac7a066 to
2461b99
Compare
97860fb to
2732b89
Compare
This is currently enabled only if the target architecture is set to
sm_80. This adds a flag to specify the target architecture for CUDA
target.