[CUDA] Add pipeline to target tensorcore ops #7655

ThomasRaoux · 2021-11-13T02:55:54Z

This is currently enabled only if the target architecture is set to
sm_80. This adds a flag to specify the target architecture for CUDA
target.

ThomasRaoux · 2021-11-14T18:20:20Z

Depends https://reviews.llvm.org/D113618

antiagainst · 2021-11-15T21:27:22Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

+  entryPoint.walk([&fusedOpSupported](linalg::GenericOp linalgOp) {
+    for (Operation &fusedOp : linalgOp.getOps()) {
+      if (!isa<arith::AddFOp, arith::MulFOp, MaxFOp, MinFOp, linalg::YieldOp,
+               linalg::GenericOp, arith::DivFOp>(fusedOp)) {


We cannot have GenericOp inside GenericOp anyway right? Isn't that verified by the op?

oops left over from a hack I had, thanks for catching this.

antiagainst · 2021-11-15T21:29:28Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

+      // tile size.
+      for (TileWorkgroupSizePair &config : TCtileSizeConfig) {
+        if (sizeN % config.tileSize[1] == 0 &&
+            sizeM % config.tileSize[0] == 0) {


No need to check K here? Is that handled somewhere else?

Good point, added it.

antiagainst · 2021-11-15T21:31:00Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUUtils.h


 static constexpr int32_t kNumGPUDims = 3;

+static constexpr int32_t kWarpSize = 32;


I just noticed that these declarations are not put in a namespace. Could you put them in the proper namespace to avoid leaking to the global namespace?

Those are marked as static so they don't pollute anything outside of the files including this header similar to the static functions.
Adding mlir and iree namespace for the whole file.

antiagainst · 2021-11-15T21:36:14Z

iree/compiler/Dialect/HAL/Target/CUDA/CUDATarget.cpp

    llvm::cl::init(false));

+static llvm::cl::opt<std::string> clTargetChip(
+    "iree-cuda-llvm-target-arch", llvm::cl::desc("LLVM target chip"),


Nit: the arch number is not llvm specific I think? Why not just call it iree-cuda-target-arch?

well it is meant to match llvm option for -march. I believe we will want more options matching LLVM target machine in the future and I think it makes sense to keep it in sync with llvm.

antiagainst · 2021-11-15T21:36:47Z

iree/compiler/Codegen/Passes.h

 /// Convert Linalg ops to Vector.
 std::unique_ptr<OperationPass<FuncOp>> createLLVMGPUVectorizationPass();

+/// Convert Linalg ops to Vector and prepare converstion to gpu mma ops.


Nit: GPU MMA

antiagainst · 2021-11-15T21:39:00Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUUtils.h

  return procInfo;
 }

+/// Compute subgroup ID. CUDA doesn't have a subgroupId equivalent so we are are


The impl should be in a .cpp file?

So this has a strong assumption that each warp is responsible for (X, Y, Z) = (<warp-size>, <full-Y>, <full-Z>) threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)

The impl should be in a .cpp file?

True, I currently don't have a cpp file. I'd rather send a NFC change follow this PR as this PR is already big and I would rather not include extra refactoring changes. Is that okay with you?

So this has a strong assumption that each warp is responsible for (X, Y, Z) = (<warp-size>, <full-Y>, <full-Z>) threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)

Good point added a comment.

antiagainst · 2021-11-15T21:41:27Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUTileAndDistribute.cpp

          Identifier::get(getWorkgroupKTiledMarker(), context)));
 }

+/// Patterns for thread level tiling.


This is warp level right? :)

antiagainst · 2021-11-15T21:41:47Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUTileAndDistribute.cpp


+/// Patterns for thread level tiling.
+static void populateTilingToWarpPatterns(
+    MLIRContext *context, OwningRewritePatternList &patterns,


Do you need to pass in context? It can be queried from patterns.

bad copy paste indeed.

antiagainst · 2021-11-15T21:52:03Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUUtils.h

+    mlir::Value subgroupId =
+        builder.create<mlir::gpu::ThreadIdOp>(loc, indexType, attr);
+    if (i == 0) {
+      mlir::AffineExpr d0 = getAffineDimExpr(0, builder.getContext());


Super nit: builder.getAffineDimExpr(0)?

MaheshRavishankar

Mostly looks fine to me. Few nits. Please ping me when it is actually ready to land.

MaheshRavishankar · 2021-11-15T21:40:52Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

-      tileY = sizeM;
-      workgroupSize = {sizeM, sizeN, 1};
+    bool foundTensorCoreConfig = false;
+    if (supportsTensorCore(entryPoint, op)) {


Just in terms of readability, move this to another function?

To add, one way to make things more readable is to think of each piece following the logic

if (is_lowering_config_set) { ... set_lowering_config }

Then its a matter of ordering these different pieces. That makes the code more readable IMO.

Changed the logic to a sequence of:

if(!isLoweringConfigSet) { if(match) { ... isLoweringConfigSet = true; } } if(!isLoweringConfigSet) {

I dont see a point of isLoweringConfigSet here either. Basically

if(getLoweringConfig(op))` { ... }

is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are

Check if configuration is already set. If so do nothing.

Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.

This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.

MaheshRavishankar · 2021-11-15T21:46:36Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

+/// operations.
+static void getTensorCoreConfig(
+    SmallVectorImpl<TileWorkgroupSizePair> &tileSizes) {
+  tileSizes.push_back(TileWorkgroupSizePair({{64, 64, 16}, {64, 2, 1}}));


I think tileSizes.emplace_back({{..}, {..}}) also works?

true

Actually can't get the compiler to take this. Somehow it is not able to infer the right type for the std::array when trying to do that.

MaheshRavishankar · 2021-11-15T21:52:47Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

+      getTensorCoreConfig(TCtileSizeConfig);
+      // Pick the best configuration where the original shape is aligned on the
+      // tile size.
+      for (TileWorkgroupSizePair &config : TCtileSizeConfig) {


Nit : This is why I dislike side data structures that are just carrying information to just populate another data structure. It adds cognitive overhead to reading the code.

The goal is to separate the configuration data from the logic. I don't really have a better solution at the time. Do you have any suggestions?

I left comment above.

MaheshRavishankar · 2021-11-15T21:53:28Z

iree/compiler/Codegen/LLVMGPU/LLVMGPUTensorCoreVectorization.cpp

+    : public OpRewritePattern<vector::BroadcastOp> {
+  using OpRewritePattern<vector::BroadcastOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(vector::BroadcastOp op,


Why is this in IREE?

We have a bit of an ordering issue upstream. After my changes to vectorization we do lowering of transfer op permutation maps early but some other transformations like vector to mma need the broadcast/transpose to be merged in the transfer op to work. This is something I need to figure out but since I haven't figured out yet which is the ideal solution I have this in IREE for now.

MaheshRavishankar · 2021-11-15T21:57:25Z

iree/compiler/Codegen/LLVMGPU/Passes.cpp

+  pm.addPass(createCSEPass());
+
+  // Distribute linalg onto warps within the workgroup.
+  pm.addNestedPass<FuncOp>(


Maybe this pass needs to be split into three.

The current tile + distribute without lowering to loops

The warp distribution

The lowering to loops

I do think it should be split. It should be one part reduction tiling + promotion, then tile and distribute to warp or threads.
Not sure what you mean by lowering to loops. We don't really have any lowering to loops in this path?
I was considering doing this change here but the PR is already big

I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.

This is currently enabled only if the target architecture is set to sm_80. This adds a flag to specify the target architecture for CUDA target.

MaheshRavishankar

Functionally this is fine. Would really recommend some additional clean up PRs before things get more involved. I left a few comments in place, the biggest thing to clean up is to avoid having duplicate data structures. The goal is that you materialize everything in the IR when you need to, and then just query the IR. Having separate channels just makes things more complicated.

MaheshRavishankar · 2021-11-16T21:26:21Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

-      tileY = sizeM;
-      workgroupSize = {sizeM, sizeN, 1};
+    bool foundTensorCoreConfig = false;
+    if (supportsTensorCore(entryPoint, op)) {


I dont see a point of isLoweringConfigSet here either. Basically

if(getLoweringConfig(op))` { ... }

is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are

Check if configuration is already set. If so do nothing.

Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.

This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.

MaheshRavishankar · 2021-11-16T21:26:37Z

iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

+      getTensorCoreConfig(TCtileSizeConfig);
+      // Pick the best configuration where the original shape is aligned on the
+      // tile size.
+      for (TileWorkgroupSizePair &config : TCtileSizeConfig) {


I left comment above.

MaheshRavishankar · 2021-11-16T21:29:02Z

iree/compiler/Codegen/LLVMGPU/Passes.cpp

+  pm.addPass(createCSEPass());
+
+  // Distribute linalg onto warps within the workgroup.
+  pm.addNestedPass<FuncOp>(


I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.

google-cla bot added the cla: yes label Nov 13, 2021

ThomasRaoux requested review from MaheshRavishankar and antiagainst November 13, 2021 02:56

ThomasRaoux force-pushed the cuda_mma branch from 609452c to 5c2a787 Compare November 13, 2021 03:07

antiagainst reviewed Nov 15, 2021

View reviewed changes

MaheshRavishankar requested changes Nov 15, 2021

View reviewed changes

ThomasRaoux requested review from MaheshRavishankar and antiagainst November 16, 2021 04:55

antiagainst approved these changes Nov 16, 2021

View reviewed changes

[CUDA] Add pipeline to do code generation targetting tensorcore ops

4dadd9a

This is currently enabled only if the target architecture is set to sm_80. This adds a flag to specify the target architecture for CUDA target.

ThomasRaoux force-pushed the cuda_mma branch from 6f1c9c7 to ac7a066 Compare November 16, 2021 21:14

MaheshRavishankar approved these changes Nov 16, 2021

View reviewed changes

Address review comments

2461b99

ThomasRaoux force-pushed the cuda_mma branch from ac7a066 to 2461b99 Compare November 16, 2021 21:41

ThomasRaoux enabled auto-merge (squash) November 16, 2021 21:46

Fix asan

2732b89

ThomasRaoux force-pushed the cuda_mma branch from 97860fb to 2732b89 Compare November 16, 2021 23:34

ThomasRaoux merged commit b793d91 into iree-org:main Nov 17, 2021

KoolJBlack mentioned this pull request Nov 17, 2021

Merge main -> google #7675

Merged


		static constexpr int32_t kNumGPUDims = 3;

		static constexpr int32_t kWarpSize = 32;

[CUDA] Add pipeline to target tensorcore ops #7655

[CUDA] Add pipeline to target tensorcore ops #7655

Uh oh!

Conversation

ThomasRaoux commented Nov 13, 2021

Uh oh!

ThomasRaoux commented Nov 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaheshRavishankar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaheshRavishankar left a comment

ThomasRaoux Nov 16, 2021 •

edited

Loading