[AMDGPU] Automatic conversion from wave32 to wave64 #137376

alex-t · 2025-04-25T17:58:11Z

Small short living kernels may become waveslot limited. To work around the problem an optimization is proposed to convert such kernels from wave32 to wave64 automatically. These kernels shall conform to a strict set of limitations and satisfy profitability conditions.

A kernel shall have no function calls as we cannot analyze call stack requirements (nor will it fall into a category of short living kernels anyway).
A kernel itself shall not be called from a device enqueue call.
A kernel shall not attempt to access EXEC or VCC in any user visible way.
A kernel must not use readlane/readfirstlane or any cross-lane/DPP operations in general.
A kernel shall not read wavefront size or use ballot through intrinsics (a use of pre-defined frontend wave size macro was deemed permissible for now).
There shall be no atomic operations of any sort as these may be used for cross-thread communication.
There shall be no LDS access as the allocation is usually tied to the workgroup size and we generally cannot extend it. It is also changing occupancy which is tied to the wave size.
There shall be no inline asm calls.
There shall be no dynamic VGPRs.
Starting from GFX11 some instructions (such as WMMA on GFX11+ and transpose loads on GFX12+) work differently (have different operands) in wave32 and wave64. The kernel shall not have intrinsics to invoke such instructions.

llvmbot · 2025-04-25T17:58:45Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (alex-t)

Changes

Small short living kernels may become waveslot limited. To work around the problem an optimization is proposed to convert such kernels from wave32 to wave64 automatically. These kernels shall conform to a strict set of limitations and satisfy profitability conditions.

A kernel shall have no function calls as we cannot analyze call stack requirements (nor will it fall into a category of short living kernels anyway).
A kernel itself shall not be called from a device enqueue call.
A kernel shall not attempt to access EXEC or VCC in any user visible way.
A kernel must not use readlane/readfirstlane or any cross-lane/DPP operations in general.
A kernel shall not read wavefront size or use ballot through intrinsics (a use of pre-defined frontend wave size macro was deemed permissible for now).
There shall be no atomic operations of any sort as these may be used for cross-thread communication.
There shall be no LDS access as the allocation is usually tied to the workgroup size and we generally cannot extend it. It is also changing occupancy which is tied to the wave size.
There shall be no inline asm calls.
There shall be no dynamic VGPRs.
Starting from GFX11 some instructions (such as WMMA on GFX11+ and transpose loads on GFX12+) work differently (have different operands) in wave32 and wave64. The kernel shall not have intrinsics to invoke such instructions.

Patch is 20.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137376.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp (+321)
(added) llvm/lib/Target/AMDGPU/SIConvertWaveSize.h (+30)
(added) llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll (+121)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 4ff761ec19b3c..76ef87ba44913 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -51,6 +51,7 @@ FunctionPass *createSIMemoryLegalizerPass();
 FunctionPass *createSIInsertWaitcntsPass();
 FunctionPass *createSIPreAllocateWWMRegsLegacyPass();
 FunctionPass *createSIFormMemoryClausesLegacyPass();
+FunctionPass *createSIConvertWaveSizeLegacyPass(const TargetMachine *);
 
 FunctionPass *createSIPostRABundlerPass();
 FunctionPass *createAMDGPUImageIntrinsicOptimizerPass(const TargetMachine *);
@@ -174,6 +175,9 @@ extern char &SIShrinkInstructionsLegacyID;
 void initializeSIFixSGPRCopiesLegacyPass(PassRegistry &);
 extern char &SIFixSGPRCopiesLegacyID;
 
+void initializeSIConvertWaveSizeLegacyPass(PassRegistry &);
+extern char &SIConvertWaveSizeLegacyID;
+
 void initializeSIFixVGPRCopiesLegacyPass(PassRegistry &);
 extern char &SIFixVGPRCopiesID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 98a1147ef6d66..0cbd3ef8da761 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -67,6 +67,7 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
 FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("si-convert-wave-size", SIConvertWaveSizePass(*static_cast<const GCNTargetMachine *>(this)))
 #undef FUNCTION_PASS
 
 #ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index b6cc5137d711a..5be1640fd3db6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -44,6 +44,7 @@
 #include "R600TargetMachine.h"
 #include "SIFixSGPRCopies.h"
 #include "SIFixVGPRCopies.h"
+#include "SIConvertWaveSize.h"
 #include "SIFoldOperands.h"
 #include "SIFormMemoryClauses.h"
 #include "SILoadStoreOptimizer.h"
@@ -506,6 +507,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeSILowerSGPRSpillsLegacyPass(*PR);
   initializeSIFixSGPRCopiesLegacyPass(*PR);
   initializeSIFixVGPRCopiesLegacyPass(*PR);
+  initializeSIConvertWaveSizeLegacyPass(*PR);
   initializeSIFoldOperandsLegacyPass(*PR);
   initializeSIPeepholeSDWALegacyPass(*PR);
   initializeSIShrinkInstructionsLegacyPass(*PR);
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 09a3096602fc3..663361face090 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -150,6 +150,7 @@ add_llvm_target(AMDGPUCodeGen
   SIAnnotateControlFlow.cpp
   SIFixSGPRCopies.cpp
   SIFixVGPRCopies.cpp
+  SIConvertWaveSize.cpp
   SIFoldOperands.cpp
   SIFormMemoryClauses.cpp
   SIFrameLowering.cpp
diff --git a/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp
new file mode 100644
index 0000000000000..4f5b839000c77
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp
@@ -0,0 +1,321 @@
+//===- SIConvertWaveSize.cpp - Automatically converts wave32 kernels to wave64
+//---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 WITH LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+// Small short living kernels may become waveslot limited.
+// To work around the problem an optimization is proposed to convert such
+// kernels from wave32 to wave64 automatically.These kernels shall conform to a
+// strict set of limitations and satisfy profitability conditions.
+//
+// 1. A kernel shall have no function calls as we cannot analyze call stack
+// requirements (nor will it fall into a category of short living kernels
+// anyway).
+// 2. A kernel itself shall not be called from a device enqueue call.
+// 3. A kernel shall not attempt to access EXEC or VCC in any user visible
+// way.
+// 4. A kernel must not use readlane/readfirstlane or any cross-lane/DPP
+// operations in general.
+// 5. A kernel shall not read wavefront size or use ballot through
+// intrinsics (a use of pre-defined frontend wave size macro was deemed
+// permissible for now).
+// 6. There shall be no atomic operations of any sort as these may be used
+// for cross-thread communication.
+// 7. There shall be no LDS access as the allocation is usually tied to the
+// workgroup size and we generally cannot extend it. It is also changing
+// occupancy which is tied to the wave size.
+// 8. There shall be no inline asm calls.
+// 9 .There shall be no dynamic VGPRs.
+// 10 .Starting from GFX11 some instructions (such as WMMA on GFX11+ and
+// transpose loads on GFX12+) work differently (have different operands) in
+// wave32 and wave64. The kernel shall not have intrinsics to invoke such
+// instructions.
+
+#include "SIConvertWaveSize.h"
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/Analysis/ScalarEvolutionExpressions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/InitializePasses.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "si-convert-wave-size"
+
+namespace {
+class SIConvertWaveSize {
+  const TargetMachine *TM;
+  const LoopInfo *LI;
+  ScalarEvolution *SE;
+  TargetTransformInfo *TTI;
+
+  InstructionCost TotalCost = 0;
+
+  static const unsigned MaxLatency = 2000;
+
+  SmallVector<Function *> Callees;
+
+public:
+  SIConvertWaveSize(const TargetMachine *TM, const LoopInfo *LI,
+                    ScalarEvolution *SE, TargetTransformInfo *TTI)
+      : TM(TM), LI(LI), SE(SE), TTI(TTI) {}
+
+  bool run(Function &F);
+
+  bool changeWaveSizeAttr(Function *F);
+};
+
+class SIConvertWaveSizeLegacy : public FunctionPass {
+  const TargetMachine *TM;
+
+public:
+  static char ID;
+  SIConvertWaveSizeLegacy(const TargetMachine *TM) : FunctionPass(ID), TM(TM) {}
+  bool runOnFunction(Function &F) override {
+    auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
+    auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
+    auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
+    SIConvertWaveSize Impl(TM, &LI, &SE, &TTI);
+    return Impl.run(F);
+  }
+  StringRef getPassName() const override { return "SI convert wave size"; }
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<LoopInfoWrapperPass>();
+    AU.addRequired<ScalarEvolutionWrapperPass>();
+    AU.setPreservesAll();
+    FunctionPass::getAnalysisUsage(AU);
+  }
+};
+} // end anonymous namespace
+
+void printFunctionAttributes(const Function &F) {
+  LLVM_DEBUG(dbgs() << "Function: " << F.getName() << "\n");
+  for (const auto &Attr : F.getAttributes()) {
+    LLVM_DEBUG(dbgs() << "  Attribute: " << Attr.getAsString() << "\n");
+  }
+}
+
+bool SIConvertWaveSize::run(Function &F) {
+  LLVM_DEBUG(dbgs() << "Running SIConvertWaveSize on function: " << F.getName() << "\n");
+  LLVM_DEBUG(printFunctionAttributes(F));
+
+  const GCNSubtarget &ST = TM->getSubtarget<GCNSubtarget>(F);
+  if (ST.getGeneration() < AMDGPUSubtarget::GFX11)
+    return false;
+
+  // Check if the function is a kernel.
+  if (F.getCallingConv() != CallingConv::AMDGPU_KERNEL)
+    return false;
+
+  // Check if the kernel is wave32
+  if (F.hasFnAttribute("target-features")) {
+    if (!F.getFnAttribute("target-features")
+            .getValueAsString().contains("wavefrontsize32")) {
+      LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Kernel is not wave32.\n");
+      return false;
+    }
+  }
+
+  // Check if the function is a device enqueue call.
+  if (F.hasFnAttribute("amdgpu-device-enqueue")) {
+    LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Device enqueue call detected.\n");
+    return false;
+  }
+
+  // Check if a trip count is a compile time constant for all loops in the
+  // kernel
+  for (Loop *L : *LI) {
+    const SCEV *TripCountSCEV = SE->getBackedgeTakenCount(L);
+    if (!isa<SCEVConstant>(TripCountSCEV)) {
+      LLVM_DEBUG(
+          dbgs() << "SIConvertWaveSize: Trip count is not a compile time "
+                    "constant.\n");
+      return false;
+    }
+  }
+
+  for (const auto &BB : F) {
+    InstructionCost BlockCost = 0;
+    for (const auto &I : BB) {
+      if (const CallBase *CB = dyn_cast<CallBase>(&I)) {
+        // FIXME: Any calls are not allowed. Only non-converged intrinsic clls
+        // and amdgsn_s_barrier are exempt. InlineAsm and Atomics are checkedd
+        // separately for debug purposes. This will be changed in the final
+        // version.
+        if (CB->isInlineAsm()) {
+          // Inline assembly is not allowed.
+          LLVM_DEBUG(dbgs()
+                     << "SIConvertWaveSize: Inline assembly detected.\n");
+          return false;
+        }
+        if (CB->isAtomic()) {
+          // Atomic operations are not allowed.
+          LLVM_DEBUG(dbgs()
+                     << "SIConvertWaveSize: Atomic operation detected.\n");
+          return false;
+        }
+        if (Function *Callee = CB->getCalledFunction()) {
+          // assuming readlane/readfirstlane or any cross-lane/DPP
+          // operations have "let isConvergent = 1" in IntrinsicsAMDGPU.td
+          if (Callee->isIntrinsic()) {
+              if (Callee->hasFnAttribute(Attribute::Convergent)) {
+                if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
+                  // TODO: what else should go in a "white list" ?
+                  // Intrinsic::amdgcn_s_barrier_wavefront ?
+                  // Intrinsic::amdgcn_s_barrier_signal ?
+                  LLVM_DEBUG(dbgs()
+                             << "SIConvertWaveSize: Convergent intrinsic "
+                             << Callee->getName() << " detected.\n");
+                  return false;
+                }
+              }
+
+            if (Callee->getIntrinsicID() == Intrinsic::read_register) {
+              if (const auto *MDVal =
+                      dyn_cast<MetadataAsValue>(CB->getArgOperand(0))) {
+                Metadata *MD = MDVal->getMetadata();
+                if (auto *MDNodeVal = dyn_cast<MDNode>(MD)) {
+                  if (MDNodeVal->getNumOperands() >= 1) {
+                    if (auto *MDStr =
+                            dyn_cast<MDString>(MDNodeVal->getOperand(0))) {
+                      if (MDStr->getString().starts_with("exec") ||
+                          MDStr->getString().starts_with("vcc")) {
+                        LLVM_DEBUG(dbgs() << "SIConvertWaveSize: read_register("
+                                          << MDStr->getString()
+                                          << ") intrinsic detected.\n");
+                        return false;
+                      }
+                    }
+                  }
+                }
+              }
+            }
+
+            // Save callee as a candidate for attribute change
+            Callees.push_back(Callee);
+          }
+        } else {
+          // General calls are not allowed.
+          LLVM_DEBUG(dbgs() << "SIConvertWaveSize: function call detected.\n");
+          return false;
+        }
+      }
+      // No  LDS access is allowed
+      if (auto LI = dyn_cast<LoadInst>(&I)) {
+        if (LI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+          LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access detected.\n");
+          return false;
+        }
+      }
+      if (auto SI = dyn_cast<StoreInst>(&I)) {
+        if (SI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+          LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access detected.\n");
+          return false;
+        }
+      }
+      // TODO: All atomics are not allowed?
+      // if (auto AI = dyn_cast<AtomicRMWInst>(&I)) {
+      //   if (AI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+      //     LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access
+      //     detected.\n"); return false;
+      //   }
+      // }
+
+      // TODO: Dynamic VGPRS and GFX11+ special operations ???
+      BlockCost +=
+          TTI->getInstructionCost(&I, TargetTransformInfo::TCK_RecipThroughput);
+    }
+    if (auto L = LI->getLoopFor(&BB)) {
+      const SCEV *TripCount = SE->getBackedgeTakenCount(L);
+      if (auto *C = dyn_cast<SCEVConstant>(TripCount)) {
+        uint64_t TC = C->getValue()->getZExtValue() + 1;
+        size_t Depth = LI->getLoopDepth(&BB);
+        BlockCost *= TC * Depth;
+      } else
+        llvm_unreachable("SIConvertWaveSize: only loops with compile time "
+                         "constant trip count could reach here!\n");
+    }
+    TotalCost += BlockCost;
+    if (TotalCost.isValid()) {
+      if (TotalCost.getValue().value() >= MaxLatency) {
+        LLVM_DEBUG(
+            dbgs() << "SIConvertWaveSize: Total latency of the kernel ["
+                   << TotalCost.getValue().value()
+                   << "] exceeds the limit of 2000 cycles - not profitable!\n");
+        return false;
+      }
+    } else
+      llvm_unreachable(
+          "SIConvertWaveSize: Cost model error - invalid state!\n");
+  }
+
+  // Additional checks can be added here...
+
+  // If all checks pass, convert wave size from wave32 to wave64.
+  // Conversion logic goes here...
+  bool Changed = changeWaveSizeAttr(&F);
+  if (Changed)
+    // Now take care of the intrinsic calls
+    for (auto C : Callees) {
+      // TODO: if we could not change Attr for one of the callee
+      // we need to rollback all the changes!
+      changeWaveSizeAttr(C);
+    }
+
+  return Changed;
+  }
+
+bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) {
+  auto Attr = F->getFnAttribute("target-features");
+  if (Attr.isValid()) {
+    StringRef AttrStr = Attr.getValueAsString();
+    size_t Pos = AttrStr.find("+wavefrontsize32");
+    if (Pos != StringRef::npos) {
+      // Remove the "+wavefrontsize32" attribute.
+      std::string NewBegin = AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
+      std::string End = AttrStr.substr(Pos + strlen("+wavefrontsize32")).str();
+      std::string NewAttrStr = NewBegin + End;
+      // Add the "+wavefrontsize64" attribute.
+      F->removeFnAttr("target-features");
+      F->addFnAttr("target-features", NewAttrStr);
+      LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Converted wave size for "
+                        << F->getName()
+                        << " from wave32 "
+                           "to wave64.\n");
+      return true;
+    }
+  }
+  return false;
+}
+
+INITIALIZE_PASS_BEGIN(SIConvertWaveSizeLegacy, DEBUG_TYPE, "SI convert wave size",
+                      false, false)
+INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
+INITIALIZE_PASS_END(SIConvertWaveSizeLegacy, DEBUG_TYPE, "SI convert wave size",
+                    false, false)
+
+char SIConvertWaveSizeLegacy::ID = 0;
+
+char &llvm::SIConvertWaveSizeLegacyID = SIConvertWaveSizeLegacy::ID;
+
+FunctionPass *llvm::createSIConvertWaveSizeLegacyPass(const TargetMachine *TM) {
+  return new SIConvertWaveSizeLegacy(TM);
+}
+
+PreservedAnalyses SIConvertWaveSizePass::run(
+    Function &F, FunctionAnalysisManager &FAM) {
+      auto &LI = FAM.getResult<LoopAnalysis>(F);
+  auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
+  auto &TTI = FAM.getResult<TargetIRAnalysis>(F);
+
+  SIConvertWaveSize Impl(TM, &LI, &SE, &TTI);
+  bool Changed = Impl.run(F);
+  return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h
new file mode 100644
index 0000000000000..78b8365ed9ebc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h
@@ -0,0 +1,30 @@
+//===- SIConvertWaveSize.h ----------------------------------------*- C++- *-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+#ifndef LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
+#define LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
+
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/Analysis/ScalarEvolution.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+namespace llvm {
+
+class SIConvertWaveSizePass : public PassInfoMixin<SIConvertWaveSizePass> {
+  /// The target machine.
+  const TargetMachine *TM;
+
+public:
+  SIConvertWaveSizePass(const TargetMachine &TM)
+      : TM(&TM) {}
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
diff --git a/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll b/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll
new file mode 100644
index 0000000000000..d90e524e9cc2e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll
@@ -0,0 +1,121 @@
+; RUN: opt -S -mtriple=amdgcn -mcpu=gfx1100 -passes=si-convert-wave-size < %s | FileCheck %s
+
+define amdgpu_kernel void @test_not_wave32(ptr addrspace(1) %out) #0 {
+  ; CHECK:  @test_not_wave32{{.*}}) #0
+  %gep = getelementptr i32, ptr addrspace(1) %out, i32 2
+  %tmp = load i32, ptr addrspace(1) %gep
+  store i32 %tmp, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @intr_non_convergent(ptr addrspace(1) nocapture %arg) #1 {
+  ; CHECK: @intr_non_convergent{{.*}} #0
+bb:
+  %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
+  %tmp1 = icmp ugt i32 %tmp, 32
+  %tmp2 = select i1 %tmp1, i32 2, i32 1
+  store i32 %tmp2, ptr addrspace(1) %arg
+  ret void
+}
+
+define amdgpu_kernel void @intr_convergent(ptr addrspace(1) nocapture %arg, i32 %X) #1 {
+  ; CHECK: @intr_convergent{{.*}}) #1
+bb:
+  %tmp = icmp ugt i32 %X, 32
+  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %tmp)
+  store i32 %ballot, ptr addrspace(1) %arg
+  ret void
+}
+
+define amdgpu_kernel void @test_barrier(ptr addrspace(1) %in, ptr addrspace(1) %out) #1 {
+  ; CHECK: @test_barrier{{.*}}) #0
+entry:
+  %val = load <2 x half>, ptr addrspace(1) %in
+  call void @llvm.amdgcn.s.barrier() #2
+  store <2 x half> %val, ptr addrspace(1) %out
+  ret void
+}
+
+
+define amdgpu_kernel void @test_read_exec(ptr addrspace(1) %out) #1 {
+  ; CHECK: @test_read_exec{{.*}}) #1
+  %exec = call i64 @llvm.read_register.i64(metadata !0)
+  store i64 %exec, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @test_read_vcc_lo(ptr addrspace(1) %out) #1 {
+  ; CHECK: @test_read_vcc_lo{{.*}}) #1
+  %vcc_lo = call i32 @llvm.read_register.i32(metadata !1)
+  store i32 %vcc_lo, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @test_read_vcc_hi(ptr addrspace(1) %out) #1 {
+  ; CHECK: @test_read_vcc_hi{{.*}}) #1
+  %vcc_hi = call i32 @llvm.read_register.i32(metadata !2)
+  store i32 %vcc_hi, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @test_lds_access(ptr addrspace(3) %out) #1 {
+  ; CHECK: @test_lds_access{{.*}}) #1
+  %gep = getelementptr i32, ptr addrspace(3) %out, i32 2
+  %tmp = load i32, ptr addrspace(3) %gep
+  store i32 %tmp, ptr addrspace(3) %out
+  ret void
+}
+
+define amdgpu_kernel void @test_simple_loop(ptr addrspace(1) nocapture %arg) #1 {
+  ; CHECK: @test_simple_loop{{.*}}) #1
+bb:
+  br label %bb2
+
+bb1:
+  ret void
+
+bb2:
+  %tmp1 = phi i32 [ 0, %bb ], [ %tmp2, %bb2 ]
+  %tmp2 = add nuw nsw i32 %tmp1, 1
+  %tmp3 = icmp eq i32 %tmp2, 1024
+  tail call void @llvm.amdgcn.s.sleep(i32 0)
+  br i1 %tmp3, label %bb1, label %bb2
+}
+
+define amdgpu_kernel void @test_nested_loop(ptr addrspace(1) nocapture %arg) #1 {
+  ; CHECK: @test_nested_loop{{.*}}) #1
+bb:
+  br label %bb2
+
+bb1:
+  ret void
+
+bb2:
+  %tmp1 = phi i32 [ 0, %bb ], [ %tmp2, %bb4 ]
+  %tmp2 = add nuw nsw i32 %tmp1, 1
+  %tmp3 = icmp eq i32 %tmp2, 8
+  br label %bb3
+
+bb3:
+  %tmp4 = phi i32 [ 0, %bb2 ], [ %tmp5, %bb3 ]
+  %tmp5 = add nuw nsw i32 %tmp4, 1
+  %tmp6 = icmp eq i32 %tmp5, 128
+  tail call void @llv...
[truncated]

alex-t · 2025-04-25T17:59:48Z

This is an early draft and was created as a discussion board to elaborate the proper approach and solutions.

github-actions · 2025-04-25T18:00:29Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
index c166def57..df799f1d4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
@@ -77,7 +77,8 @@ class AMDGPUConvertWaveSizeLegacy : public FunctionPass {
 
 public:
   static char ID;
-  AMDGPUConvertWaveSizeLegacy(const GCNTargetMachine *TM) : FunctionPass(ID), TM(TM) {}
+  AMDGPUConvertWaveSizeLegacy(const GCNTargetMachine *TM)
+      : FunctionPass(ID), TM(TM) {}
   bool runOnFunction(Function &F) override {
     auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
     auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
@@ -181,84 +182,83 @@ bool AMDGPUConvertWaveSize::run(Function &F) {
           // assuming readlane/readfirstlane or any cross-lane/DPP
           // operations have "let isConvergent = 1" in IntrinsicsAMDGPU.td
           if (Callee->isIntrinsic()) {
-              if (Callee->hasFnAttribute(Attribute::Convergent)) {
-                if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
-                  // TODO: what else should go in a "white list" ?
-                  // Intrinsic::amdgcn_s_barrier_wavefront ?
-                  // Intrinsic::amdgcn_s_barrier_signal ?
-                  LLVM_DEBUG(dbgs()
-                             << "AMDGPUConvertWaveSize: Convergent intrinsic "
-                             << Callee->getName() << " detected.\n");
-                  return false;
-                }
-              }
-
-              if (Callee->getIntrinsicID() == Intrinsic::read_register ||
-                  Callee->getIntrinsicID() == Intrinsic::write_register) {
-
+            if (Callee->hasFnAttribute(Attribute::Convergent)) {
+              if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
+                // TODO: what else should go in a "white list" ?
+                // Intrinsic::amdgcn_s_barrier_wavefront ?
+                // Intrinsic::amdgcn_s_barrier_signal ?
                 LLVM_DEBUG(dbgs()
-                           << "AMDGPUConvertWaveSize: read/write_register "
-                              "intrinsic detected.\n");
+                           << "AMDGPUConvertWaveSize: Convergent intrinsic "
+                           << Callee->getName() << " detected.\n");
                 return false;
               }
-
-              // Take care of LDS access
-              if (const auto *MTI = dyn_cast<MemTransferInst>(&I)) {
-                auto DstAS = MTI->getDestAddressSpace();
-                auto SrcAS = MTI->getSourceAddressSpace();
-                if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
-                    SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
-                  LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
-                                       "(llvm.memcpy/memmove) detected.\n");
-                  return false;
-                }
-              } else if (const auto *MSI = dyn_cast<MemSetInst>(&I)) {
-                auto DstAS = MSI->getDestAddressSpace();
-                if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
-                  LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
-                                       "(llvm.memset) detected.\n");
-                  return false;
-                }
-              } else if (const auto AMCI = dyn_cast<AtomicMemCpyInst>(&I)) {
-                auto DstAS = AMCI->getDestAddressSpace();
-                auto SrcAS = AMCI->getSourceAddressSpace();
-                if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
-                    SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
-                  LLVM_DEBUG(
-                      dbgs()
-                      << "AMDGPUConvertWaveSize: LDS access "
-                         "(llvm.memcpy.element.unordered.atomic) detected\n");
-                  return false;
-                }
-              } else
-              if (const auto AMMI = dyn_cast<AtomicMemMoveInst>(&I)) {
-                auto DstAS = AMMI->getDestAddressSpace();
-                auto SrcAS = AMMI->getSourceAddressSpace();
-                if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
-                    SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
-                  LLVM_DEBUG(
-                      dbgs()
-                      << "AMDGPUConvertWaveSize: LDS access "
-                         "(llvm.memmove.element.unordered.atomic) detected.\n");
-                  return false;
-                }
-              } else if (const auto *AMSI = dyn_cast<AtomicMemSetInst>(&I)) {
-                auto DstAS = AMSI->getDestAddressSpace();
-                if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
-                  LLVM_DEBUG(
-                      dbgs()
-                      << "AMDGPUConvertWaveSize: LDS access "
-                         "(llvm.memset.element.unordered.atomic) detected.\n");
-                  return false;
-                }
+            }
+
+            if (Callee->getIntrinsicID() == Intrinsic::read_register ||
+                Callee->getIntrinsicID() == Intrinsic::write_register) {
+
+              LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: read/write_register "
+                                   "intrinsic detected.\n");
+              return false;
+            }
+
+            // Take care of LDS access
+            if (const auto *MTI = dyn_cast<MemTransferInst>(&I)) {
+              auto DstAS = MTI->getDestAddressSpace();
+              auto SrcAS = MTI->getSourceAddressSpace();
+              if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+                  SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+                LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
+                                     "(llvm.memcpy/memmove) detected.\n");
+                return false;
+              }
+            } else if (const auto *MSI = dyn_cast<MemSetInst>(&I)) {
+              auto DstAS = MSI->getDestAddressSpace();
+              if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
+                LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
+                                     "(llvm.memset) detected.\n");
+                return false;
               }
+            } else if (const auto AMCI = dyn_cast<AtomicMemCpyInst>(&I)) {
+              auto DstAS = AMCI->getDestAddressSpace();
+              auto SrcAS = AMCI->getSourceAddressSpace();
+              if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+                  SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+                LLVM_DEBUG(
+                    dbgs()
+                    << "AMDGPUConvertWaveSize: LDS access "
+                       "(llvm.memcpy.element.unordered.atomic) detected\n");
+                return false;
+              }
+            } else if (const auto AMMI = dyn_cast<AtomicMemMoveInst>(&I)) {
+              auto DstAS = AMMI->getDestAddressSpace();
+              auto SrcAS = AMMI->getSourceAddressSpace();
+              if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+                  SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+                LLVM_DEBUG(
+                    dbgs()
+                    << "AMDGPUConvertWaveSize: LDS access "
+                       "(llvm.memmove.element.unordered.atomic) detected.\n");
+                return false;
+              }
+            } else if (const auto *AMSI = dyn_cast<AtomicMemSetInst>(&I)) {
+              auto DstAS = AMSI->getDestAddressSpace();
+              if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
+                LLVM_DEBUG(
+                    dbgs()
+                    << "AMDGPUConvertWaveSize: LDS access "
+                       "(llvm.memset.element.unordered.atomic) detected.\n");
+                return false;
+              }
+            }
 
             // Save callee as a candidate for attribute change
             Callees.push_back(Callee);
           }
         } else {
           // General calls are not allowed.
-          LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: function call detected.\n");
+          LLVM_DEBUG(dbgs()
+                     << "AMDGPUConvertWaveSize: function call detected.\n");
           return false;
         }
       }
@@ -309,9 +309,9 @@ bool AMDGPUConvertWaveSize::run(Function &F) {
 
       if (const auto MemIntr = dyn_cast<MemIntrinsic>(&I))
 
-      // TODO: Dynamic VGPRS and GFX11+ special operations ???
-      BlockCost +=
-          TTI->getInstructionCost(&I, TargetTransformInfo::TCK_Latency);
+        // TODO: Dynamic VGPRS and GFX11+ special operations ???
+        BlockCost +=
+            TTI->getInstructionCost(&I, TargetTransformInfo::TCK_Latency);
     }
     if (auto L = LI->getLoopFor(&BB)) {
       const SCEV *TripCount = SE->getBackedgeTakenCount(L);
@@ -360,7 +360,8 @@ bool AMDGPUConvertWaveSize::changeWaveSizeAttr(Function *F) {
     size_t Pos = AttrStr.find("+wavefrontsize32");
     if (Pos != StringRef::npos) {
       // Remove the "+wavefrontsize32" attribute.
-      std::string NewBegin = AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
+      std::string NewBegin =
+          AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
       std::string End = AttrStr.substr(Pos + strlen("+wavefrontsize32")).str();
       std::string NewAttrStr = NewBegin + End;
       // Add the "+wavefrontsize64" attribute.
@@ -376,23 +377,24 @@ bool AMDGPUConvertWaveSize::changeWaveSizeAttr(Function *F) {
   return false;
 }
 
-INITIALIZE_PASS_BEGIN(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE, "AMDGPU convert wave size",
-                      false, false)
+INITIALIZE_PASS_BEGIN(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE,
+                      "AMDGPU convert wave size", false, false)
 INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
-INITIALIZE_PASS_END(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE, "AMDGPU convert wave size",
-                    false, false)
+INITIALIZE_PASS_END(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE,
+                    "AMDGPU convert wave size", false, false)
 
 char AMDGPUConvertWaveSizeLegacy::ID = 0;
 
 char &llvm::AMDGPUConvertWaveSizeLegacyID = AMDGPUConvertWaveSizeLegacy::ID;
 
-FunctionPass *llvm::createAMDGPUConvertWaveSizeLegacyPass(const GCNTargetMachine *TM) {
+FunctionPass *
+llvm::createAMDGPUConvertWaveSizeLegacyPass(const GCNTargetMachine *TM) {
   return new AMDGPUConvertWaveSizeLegacy(TM);
 }
 
-PreservedAnalyses AMDGPUConvertWaveSizePass::run(
-    Function &F, FunctionAnalysisManager &FAM) {
+PreservedAnalyses AMDGPUConvertWaveSizePass::run(Function &F,
+                                                 FunctionAnalysisManager &FAM) {
   auto &LI = FAM.getResult<LoopAnalysis>(F);
   auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
   auto &TTI = FAM.getResult<TargetIRAnalysis>(F);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
index e5b8c92c0..88152369e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
@@ -1,4 +1,5 @@
-//===- SIConvertWaveSize.h ----------------------------------------*- C++- *-===//
+//===- SIConvertWaveSize.h ----------------------------------------*- C++-
+//*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -16,13 +17,13 @@
 
 namespace llvm {
 
-class AMDGPUConvertWaveSizePass : public PassInfoMixin<AMDGPUConvertWaveSizePass> {
+class AMDGPUConvertWaveSizePass
+    : public PassInfoMixin<AMDGPUConvertWaveSizePass> {
   /// The target machine.
   const GCNTargetMachine *TM;
 
 public:
-  AMDGPUConvertWaveSizePass(const GCNTargetMachine &TM)
-      : TM(&TM) {}
+  AMDGPUConvertWaveSizePass(const GCNTargetMachine &TM) : TM(&TM) {}
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
 };
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 6b214beb4..3dee8d09b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -17,6 +17,7 @@
 #include "AMDGPUTargetMachine.h"
 #include "AMDGPU.h"
 #include "AMDGPUAliasAnalysis.h"
+#include "AMDGPUConvertWaveSize.h"
 #include "AMDGPUCtorDtorLowering.h"
 #include "AMDGPUExportClustering.h"
 #include "AMDGPUExportKernelRuntimeHandles.h"
@@ -44,7 +45,6 @@
 #include "R600TargetMachine.h"
 #include "SIFixSGPRCopies.h"
 #include "SIFixVGPRCopies.h"
-#include "AMDGPUConvertWaveSize.h"
 #include "SIFoldOperands.h"
 #include "SIFormMemoryClauses.h"
 #include "SILoadStoreOptimizer.h"

alex-t · 2025-04-25T18:00:39Z

A few questions to discuss right now:

May we assume that for the targets which have a wave32 as default correct attribute will be set by the FE?
Non-convergent intrinsic calls are left in the IR as a calls. I guess we should change the callee version (.i32/i64) if we set wave64 on the caller.

rampitec · 2025-04-25T19:13:59Z

A few questions to discuss right now:

1. May we assume that for the targets which have a wave32 as default correct attribute will be set by the FE?

You always know the wave size as long as you have subtarget. I assume the attribute shall be always set by FE.

2. Non-convergent intrinsic calls are left in the IR as a calls. I guess we should change the callee version (.i32/i64) if we set wave64 on the caller.

I am not sure what do you mean, but it sounds you are referring to intrinsics accessing wave size since these need such mangling. That shall not be allowed.

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

rampitec · 2025-04-25T18:52:47Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+  }
+
+  // Check if the function is a device enqueue call.
+  if (F.hasFnAttribute("amdgpu-device-enqueue")) {


I did not find such attribute in our sources.

Just forgot to comment it out. It does not exist yet. The placeholder was added to mark an idea to discuss: could we make the FE to set such an attribute basing on the @llvm.amdgcn.dispatch.ptr, @llvm.amdgcn.implicitarg.ptr and other relevant attributes presence?

I believe kernel just shall have no references to its symbol, i.t. its address shall not escape.

No, we can only do negative usage detection in AMDGPUAttributor. Attributes that specify a feature are positively used are broken, we don't have any of those anymore.

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

rampitec · 2025-04-25T19:00:22Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+        }
+      }
+      // No  LDS access is allowed
+      if (auto LI = dyn_cast<LoadInst>(&I)) {


Load or store to local address is not a sufficient check. LDS may be addressed by flat. The kernel shall have no references to LDS objects.

I have a few questions...

We can pass a flat pointer as an argument.

We cannot check that it does not point to LDS.

"noalias" attribute does not help.

AMDGPU::addrspacesMayAlias states that the Flat may alias to anything. - On the other hand, it would be ABI violation if kernel argument aliased the object in LDS.

We can have a non-pointer (i32) argument and int2ptr to addrspace 3 - trivial to check

addrspacecast to addrspace 3 - trivial to check

Any arbitrary integer expression cast to flat pointer. - We cannot check if it won't point to LDS (probably corner case)

If you're restricting this to kernels, it's easier to check if there is no allocation on the kernel (e.g. check amdgpu-lds-size, would need to double check how this interacts with the dynamic LDS case)

You cannot pass a pointer to LDS into a kernel. There is no way to allocate it. But dynamic LDS shall be disallowed.

But a kernel can have an LDS allocation as a kernel argument, as how it works in opencl. It has no content

Right, that's what I mean. Transformation shall be disabled if we have dynamic LDS.

Current change, regarding LDS access check is a draft aiming to figure out - does it look like a right approach?

Current change, regarding LDS access check is a draft aiming to figure out - does it look like a right approach?

You do not need to check all possible memory instructions, just check if LDS globals are used anywhere, and also that dynamic LDS is not used.

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

rampitec · 2025-04-25T19:05:55Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+  }
+
+bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) {
+  auto Attr = F->getFnAttribute("target-features");


In addition you need to change AMDGPUSubtarget::WavefrontSizeLog2. ST.getWavefrontSize() should return correct value.

rampitec · 2025-04-25T19:08:16Z

llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll

@@ -0,0 +1,121 @@
+; RUN: opt -S -mtriple=amdgcn -mcpu=gfx1100 -passes=si-convert-wave-size < %s | FileCheck %s


Also need to test full compilation and check resulting .amdhsa_wavefront_size32 value.

arsenm

This would be easier to implement in attributor

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

arsenm · 2025-04-25T19:18:57Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+// 1. A kernel shall have no function calls as we cannot analyze call stack
+// requirements (nor will it fall into a category of short living kernels
+// anyway).


If you did this in the attributor and could convert the entire call graph this wouldn't be an issue

Not needed though. A cost of one call will make the whole transformation useless.

The call itself isn't necessarily expensive, but the attributor would also take care of most of these queries even within a single kernel

We are talking about quite low latency threshold.

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

arsenm · 2025-04-25T19:20:14Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+void printFunctionAttributes(const Function &F) {
+  LLVM_DEBUG(dbgs() << "Function: " << F.getName() << "\n");
+  for (const auto &Attr : F.getAttributes()) {
+    LLVM_DEBUG(dbgs() << "  Attribute: " << Attr.getAsString() << "\n");
+  }
+}


Leftover debug

It was left for the purpose as this is an early draft. I will remove it once the debugging is finished.

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

arsenm · 2025-04-25T19:27:11Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+  return Changed;
+  }
+
+bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) {


You shouldn't be trying to parse or find the position, you can simply append the new wavesize feature at the end

arsenm · 2025-04-25T19:28:01Z

llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll

+bb4:
+  br i1 %tmp3, label %bb1, label %bb2
+}
+


test atomicrmw + cmpxchg + memory intrinsics

arsenm · 2025-04-25T19:30:24Z

llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp

+      F->removeFnAttr("target-features");
+      F->addFnAttr("target-features", NewAttrStr);


just doing addFnAttr should override the old value

arsenm

This also probably should be an opt-in feature... I don't think we have any API rules that prevent you from assuming the wavesize will match what you compiled for. At the very least we need an explicit opt-out mechanism

alex-t · 2025-04-28T12:13:31Z

I am not sure what do you mean, but it sounds you are referring to intrinsics accessing wave size since these need such mangling. That shall not be allowed.

@llvm.read_register.i32( some regular reg) is allowed as it is not convergent and does not read/write exec or vcc. We should use llvm.read_register.i64 instead if convert the caller kernel to wave64, otherwise it only will read a low half of the register.

rampitec · 2025-04-28T17:09:00Z

I am not sure what do you mean, but it sounds you are referring to intrinsics accessing wave size since these need such mangling. That shall not be allowed.

@llvm.read_register.i32( some regular reg) is allowed as it is not convergent and does not read/write exec or vcc. We should use llvm.read_register.i64 instead if convert the caller kernel to wave64, otherwise it only will read a low half of the register.

You cannot do it and keep the semantics. Just bail on any use of these intrinsics regardless of a register accessed.

…dressed.

…d name fixed

arsenm · 2025-05-07T19:40:04Z

llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp

+  // Check if the function can be called via device enqueue.
+  bool addressEscapes = false;
+  if (!F.use_empty()) {
+    const Module *M = F.getParent();
+    for (const GlobalVariable &GV : M->globals()) {
+      if (GV.hasInitializer()) {
+        if (const Constant *Init = GV.getInitializer()) {
+          if (isa<Function>(Init) && Init == &F) {
+            addressEscapes = true;
+          }
+        }
+      }
+    }
+  }
+
+  if (addressEscapes) {


F.hasAddressTaken (but again, this whole thing would be easier to implement in attributor)

alex-t requested a review from rampitec April 25, 2025 17:58

llvmbot added the backend:AMDGPU label Apr 25, 2025

[AMDGPU] Automatic conversion from wave32 to wave64

0e647db

rampitec reviewed Apr 25, 2025

View reviewed changes

arsenm reviewed Apr 25, 2025

View reviewed changes

alex-t added 2 commits May 7, 2025 21:01

[AMDGPU] Automatic conversion from wave32 to wave64. Review issues ad…

4bc8133

…dressed.

[AMDGPU] Automatic conversion from wave32 to wave64. Initialize metho…

beebaa2

…d name fixed

arsenm reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Automatic conversion from wave32 to wave64 #137376

[AMDGPU] Automatic conversion from wave32 to wave64 #137376

alex-t commented Apr 25, 2025

llvmbot commented Apr 25, 2025

alex-t commented Apr 25, 2025

github-actions bot commented Apr 25, 2025 •

edited

Loading

alex-t commented Apr 25, 2025

rampitec commented Apr 25, 2025

rampitec Apr 25, 2025

alex-t Apr 28, 2025

rampitec Apr 28, 2025

arsenm May 1, 2025

rampitec Apr 25, 2025

alex-t May 5, 2025

arsenm May 5, 2025

rampitec May 5, 2025

arsenm May 5, 2025

rampitec May 5, 2025

alex-t May 7, 2025

rampitec May 7, 2025

rampitec Apr 25, 2025

rampitec Apr 25, 2025

arsenm left a comment

arsenm Apr 25, 2025

rampitec Apr 25, 2025

arsenm May 5, 2025

rampitec May 5, 2025

arsenm Apr 25, 2025

alex-t May 5, 2025

arsenm Apr 25, 2025

arsenm Apr 25, 2025

arsenm Apr 25, 2025

arsenm left a comment

alex-t commented Apr 28, 2025 •

edited

Loading

rampitec commented Apr 28, 2025

arsenm May 7, 2025

		@@ -0,0 +1,121 @@
		; RUN: opt -S -mtriple=amdgcn -mcpu=gfx1100 -passes=si-convert-wave-size < %s \| FileCheck %s

		F->removeFnAttr("target-features");
		F->addFnAttr("target-features", NewAttrStr);

[AMDGPU] Automatic conversion from wave32 to wave64 #137376

Are you sure you want to change the base?

[AMDGPU] Automatic conversion from wave32 to wave64 #137376

Conversation

alex-t commented Apr 25, 2025

llvmbot commented Apr 25, 2025

alex-t commented Apr 25, 2025

github-actions bot commented Apr 25, 2025 • edited Loading

alex-t commented Apr 25, 2025

rampitec commented Apr 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

alex-t commented Apr 28, 2025 • edited Loading

rampitec commented Apr 28, 2025

Choose a reason for hiding this comment

github-actions bot commented Apr 25, 2025 •

edited

Loading

alex-t commented Apr 28, 2025 •

edited

Loading