Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ThomasRaoux
Copy link
Contributor

This is currently enabled only if the target architecture is set to
sm_80. This adds a flag to specify the target architecture for CUDA
target.

@ThomasRaoux
Copy link
Contributor Author

Depends https://reviews.llvm.org/D113618

entryPoint.walk([&fusedOpSupported](linalg::GenericOp linalgOp) {
for (Operation &fusedOp : linalgOp.getOps()) {
if (!isa<arith::AddFOp, arith::MulFOp, MaxFOp, MinFOp, linalg::YieldOp,
linalg::GenericOp, arith::DivFOp>(fusedOp)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have GenericOp inside GenericOp anyway right? Isn't that verified by the op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops left over from a hack I had, thanks for catching this.

// tile size.
for (TileWorkgroupSizePair &config : TCtileSizeConfig) {
if (sizeN % config.tileSize[1] == 0 &&
sizeM % config.tileSize[0] == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check K here? Is that handled somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added it.


static constexpr int32_t kNumGPUDims = 3;

static constexpr int32_t kWarpSize = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that these declarations are not put in a namespace. Could you put them in the proper namespace to avoid leaking to the global namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are marked as static so they don't pollute anything outside of the files including this header similar to the static functions.
Adding mlir and iree namespace for the whole file.

llvm::cl::init(false));

static llvm::cl::opt<std::string> clTargetChip(
"iree-cuda-llvm-target-arch", llvm::cl::desc("LLVM target chip"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the arch number is not llvm specific I think? Why not just call it iree-cuda-target-arch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it is meant to match llvm option for -march. I believe we will want more options matching LLVM target machine in the future and I think it makes sense to keep it in sync with llvm.

/// Convert Linalg ops to Vector.
std::unique_ptr<OperationPass<FuncOp>> createLLVMGPUVectorizationPass();

/// Convert Linalg ops to Vector and prepare converstion to gpu mma ops.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: GPU MMA

return procInfo;
}

/// Compute subgroup ID. CUDA doesn't have a subgroupId equivalent so we are are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl should be in a .cpp file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this has a strong assumption that each warp is responsible for (X, Y, Z) = (<warp-size>, <full-Y>, <full-Z>) threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)

Copy link
Contributor Author

@ThomasRaoux ThomasRaoux Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl should be in a .cpp file?

True, I currently don't have a cpp file. I'd rather send a NFC change follow this PR as this PR is already big and I would rather not include extra refactoring changes. Is that okay with you?

So this has a strong assumption that each warp is responsible for (X, Y, Z) = (<warp-size>, <full-Y>, <full-Z>) threads. It would be great to specify that explicitly in the comment to make it clear for the curious. :)

Good point added a comment.

Identifier::get(getWorkgroupKTiledMarker(), context)));
}

/// Patterns for thread level tiling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is warp level right? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed


/// Patterns for thread level tiling.
static void populateTilingToWarpPatterns(
MLIRContext *context, OwningRewritePatternList &patterns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to pass in context? It can be queried from patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad copy paste indeed.

mlir::Value subgroupId =
builder.create<mlir::gpu::ThreadIdOp>(loc, indexType, attr);
if (i == 0) {
mlir::AffineExpr d0 = getAffineDimExpr(0, builder.getContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: builder.getAffineDimExpr(0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine to me. Few nits. Please ping me when it is actually ready to land.

tileY = sizeM;
workgroupSize = {sizeM, sizeN, 1};
bool foundTensorCoreConfig = false;
if (supportsTensorCore(entryPoint, op)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in terms of readability, move this to another function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add, one way to make things more readable is to think of each piece following the logic

if (is_lowering_config_set) {
   ...
   set_lowering_config
}

Then its a matter of ordering these different pieces. That makes the code more readable IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the logic to a sequence of:

if(!isLoweringConfigSet) {
 if(match) {
    ...
    isLoweringConfigSet = true;
  }
} 
if(!isLoweringConfigSet) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see a point of isLoweringConfigSet here either. Basically

if(getLoweringConfig(op))` {
  ...
}

is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are

  1. Check if configuration is already set. If so do nothing.
  2. Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.

This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.

/// operations.
static void getTensorCoreConfig(
SmallVectorImpl<TileWorkgroupSizePair> &tileSizes) {
tileSizes.push_back(TileWorkgroupSizePair({{64, 64, 16}, {64, 2, 1}}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tileSizes.emplace_back({{..}, {..}}) also works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Actually can't get the compiler to take this. Somehow it is not able to infer the right type for the std::array when trying to do that.

getTensorCoreConfig(TCtileSizeConfig);
// Pick the best configuration where the original shape is aligned on the
// tile size.
for (TileWorkgroupSizePair &config : TCtileSizeConfig) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : This is why I dislike side data structures that are just carrying information to just populate another data structure. It adds cognitive overhead to reading the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to separate the configuration data from the logic. I don't really have a better solution at the time. Do you have any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comment above.

: public OpRewritePattern<vector::BroadcastOp> {
using OpRewritePattern<vector::BroadcastOp>::OpRewritePattern;

LogicalResult matchAndRewrite(vector::BroadcastOp op,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in IREE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a bit of an ordering issue upstream. After my changes to vectorization we do lowering of transfer op permutation maps early but some other transformations like vector to mma need the broadcast/transpose to be merged in the transfer op to work. This is something I need to figure out but since I haven't figured out yet which is the ideal solution I have this in IREE for now.

pm.addPass(createCSEPass());

// Distribute linalg onto warps within the workgroup.
pm.addNestedPass<FuncOp>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this pass needs to be split into three.

  1. The current tile + distribute without lowering to loops
  2. The warp distribution
  3. The lowering to loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it should be split. It should be one part reduction tiling + promotion, then tile and distribute to warp or threads.
Not sure what you mean by lowering to loops. We don't really have any lowering to loops in this path?
I was considering doing this change here but the PR is already big

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.

This is currently enabled only if the target architecture is set to
sm_80. This adds a flag to specify the target architecture for CUDA
target.
Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally this is fine. Would really recommend some additional clean up PRs before things get more involved. I left a few comments in place, the biggest thing to clean up is to avoid having duplicate data structures. The goal is that you materialize everything in the IR when you need to, and then just query the IR. Having separate channels just makes things more complicated.

tileY = sizeM;
workgroupSize = {sizeM, sizeN, 1};
bool foundTensorCoreConfig = false;
if (supportsTensorCore(entryPoint, op)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see a point of isLoweringConfigSet here either. Basically

if(getLoweringConfig(op))` {
  ...
}

is same as isLoweringConfigSet. There is too much of chained conditioning here, which is really hard to read and really hard to modify. In many ways the configuration attributes are really to avoid this kind of code-pattern. The steps are

  1. Check if configuration is already set. If so do nothing.
  2. Set configuration in the IR. Whether you have set the configuration or not is specified in the IR instead of a side-car data-structure.

This is basically the thumb rule of materialize state in IR, instead of c++ code. You dont need the chain of isTensorCoreSupported -> getTensorCoreConfig -> setTensorCoreConfig. If tensor core is supported, you set the config in the IR directly. There is no need to add TileWorkgroupSizePair data structure here. You just call the setOpConfigAndEntryPointFnTranslation and this will put all the information in the right place.

getTensorCoreConfig(TCtileSizeConfig);
// Pick the best configuration where the original shape is aligned on the
// tile size.
for (TileWorkgroupSizePair &config : TCtileSizeConfig) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comment above.

pm.addPass(createCSEPass());

// Distribute linalg onto warps within the workgroup.
pm.addNestedPass<FuncOp>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using lowering to loops as a catch all to lowering to scalar + vector code. Basically last step before just NVVM translation.
Fine to do it as a next step.

@ThomasRaoux ThomasRaoux merged commit b793d91 into iree-org:main Nov 17, 2021
@KoolJBlack KoolJBlack mentioned this pull request Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants