[AMDGPU] Move kernarg preload logic to separate pass #130434

kerbowa · 2025-03-08T19:33:37Z

Moves kernarg preload logic to its own module pass. Cloned function
declarations are removed when preloading hidden arguments. The inreg
attribute is now added in this pass instead of AMDGPUAttributor. The
rest of the logic is copied from AMDGPULowerKernelArguments which now
only check whether an arguments is marked inreg to avoid replacing
direct uses of preloaded arguments. This change requires test updates to
remove inreg from lit tests with kernels that don't actually want
preloading.

kerbowa · 2025-03-08T19:33:54Z

[AMDGPU] Move kernarg preload logic to separate pass #130434 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-03-08T19:36:08Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

Moves kernarg preload logic to its own module pass. Cloned function
declarations are removed when preloading hidden arguments. The inreg
attribute is now added in this pass instead of AMDGPUAttributor. The
rest of the logic is copied from AMDGPULowerKernelArguments which now
only check whether an arguments is marked inreg to avoid replacing
direct uses of preloaded arguments. This change requires test updates to
remove inreg from lit tests with kernels that don't actually want
preloading.

Patch is 74.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/130434.diff

19 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+15)
(modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (-21)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp (+2-254)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(added) llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp (+358)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd-with-ret.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+6-27)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-IR-lowering.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs-debug-info.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs-IR-lowering.ll (+13-31)
(removed) llvm/test/CodeGen/AMDGPU/preload-kernargs-inreg-hints.ll (-263)
(modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+3-7)
(modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+4-4)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 57297288eecb4..4c26f148310e2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -65,6 +65,7 @@ ModulePass *createAMDGPULowerBufferFatPointersPass();
 FunctionPass *createSIModeRegisterPass();
 FunctionPass *createGCNPreRAOptimizationsLegacyPass();
 FunctionPass *createAMDGPUPreloadKernArgPrologLegacyPass();
+ModulePass *createAMDGPUPreloadKernelArgumentsLegacyPass(const TargetMachine *);
 
 struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {
   AMDGPUSimplifyLibCallsPass() {}
@@ -234,6 +235,9 @@ extern char &GCNRegPressurePrinterID;
 void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernArgPrologLegacyID;
 
+void initializeAMDGPUPreloadKernelArgumentsLegacyPass(PassRegistry &);
+extern char &AMDGPUPreloadKernelArgumentsLegacyID;
+
 // Passes common to R600 and SI
 FunctionPass *createAMDGPUPromoteAlloca();
 void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
@@ -345,6 +349,17 @@ class AMDGPUAttributorPass : public PassInfoMixin<AMDGPUAttributorPass> {
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
 
+class AMDGPUPreloadKernelArgumentsPass
+    : public PassInfoMixin<AMDGPUPreloadKernelArgumentsPass> {
+  const AMDGPUTargetMachine &TM;
+
+public:
+  explicit AMDGPUPreloadKernelArgumentsPass(const AMDGPUTargetMachine &TM)
+      : TM(TM) {}
+
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
 class AMDGPUAnnotateUniformValuesPass
     : public PassInfoMixin<AMDGPUAnnotateUniformValuesPass> {
 public:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index cfff66fa07f98..bbfa88a3fe872 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -28,10 +28,6 @@ void initializeCycleInfoWrapperPassPass(PassRegistry &);
 
 using namespace llvm;
 
-static cl::opt<unsigned> KernargPreloadCount(
-    "amdgpu-kernarg-preload-count",
-    cl::desc("How many kernel arguments to preload onto SGPRs"), cl::init(0));
-
 static cl::opt<unsigned> IndirectCallSpecializationThreshold(
     "amdgpu-indirect-call-specialization-threshold",
     cl::desc(
@@ -1319,21 +1315,6 @@ struct AAAMDGPUNoAGPR
 
 const char AAAMDGPUNoAGPR::ID = 0;
 
-static void addPreloadKernArgHint(Function &F, TargetMachine &TM) {
-  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
-  for (unsigned I = 0;
-       I < F.arg_size() &&
-       I < std::min(KernargPreloadCount.getValue(), ST.getMaxNumUserSGPRs());
-       ++I) {
-    Argument &Arg = *F.getArg(I);
-    // Check for incompatible attributes.
-    if (Arg.hasByRefAttr() || Arg.hasNestAttr())
-      break;
-
-    Arg.addAttr(Attribute::InReg);
-  }
-}
-
 static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
                     AMDGPUAttributorOptions Options) {
   SetVector<Function *> Functions;
@@ -1383,8 +1364,6 @@ static bool runImpl(Module &M, AnalysisGetter &AG, TargetMachine &TM,
     if (!AMDGPU::isEntryFunctionCC(CC)) {
       A.getOrCreateAAFor<AAAMDFlatWorkGroupSize>(IRPosition::function(*F));
       A.getOrCreateAAFor<AAAMDWavesPerEU>(IRPosition::function(*F));
-    } else if (CC == CallingConv::AMDGPU_KERNEL) {
-      addPreloadKernArgHint(*F, TM);
     }
 
     for (auto &I : instructions(F)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
index 09412d1b0f1cc..d2dd6869f1070 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
@@ -27,231 +27,6 @@ using namespace llvm;
 
 namespace {
 
-class PreloadKernelArgInfo {
-private:
-  Function &F;
-  const GCNSubtarget &ST;
-  unsigned NumFreeUserSGPRs;
-
-  enum HiddenArg : unsigned {
-    HIDDEN_BLOCK_COUNT_X,
-    HIDDEN_BLOCK_COUNT_Y,
-    HIDDEN_BLOCK_COUNT_Z,
-    HIDDEN_GROUP_SIZE_X,
-    HIDDEN_GROUP_SIZE_Y,
-    HIDDEN_GROUP_SIZE_Z,
-    HIDDEN_REMAINDER_X,
-    HIDDEN_REMAINDER_Y,
-    HIDDEN_REMAINDER_Z,
-    END_HIDDEN_ARGS
-  };
-
-  // Stores information about a specific hidden argument.
-  struct HiddenArgInfo {
-    // Offset in bytes from the location in the kernearg segment pointed to by
-    // the implicitarg pointer.
-    uint8_t Offset;
-    // The size of the hidden argument in bytes.
-    uint8_t Size;
-    // The name of the hidden argument in the kernel signature.
-    const char *Name;
-  };
-
-  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
-      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
-      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
-      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
-      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
-      {22, 2, "_hidden_remainder_z"}};
-
-  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
-    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
-      if (HiddenArgs[I].Offset == Offset)
-        return static_cast<HiddenArg>(I);
-
-    return END_HIDDEN_ARGS;
-  }
-
-  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS)
-      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
-
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  static const char *getHiddenArgName(HiddenArg HA) {
-    if (HA < END_HIDDEN_ARGS) {
-      return HiddenArgs[HA].Name;
-    }
-    llvm_unreachable("Unexpected hidden argument.");
-  }
-
-  // Clones the function after adding implicit arguments to the argument list
-  // and returns the new updated function. Preloaded implicit arguments are
-  // added up to and including the last one that will be preloaded, indicated by
-  // LastPreloadIndex. Currently preloading is only performed on the totality of
-  // sequential data from the kernarg segment including implicit (hidden)
-  // arguments. This means that all arguments up to the last preloaded argument
-  // will also be preloaded even if that data is unused.
-  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
-    FunctionType *FT = F.getFunctionType();
-    LLVMContext &Ctx = F.getParent()->getContext();
-    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
-      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
-
-    FunctionType *NFT =
-        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
-    Function *NF =
-        Function::Create(NFT, F.getLinkage(), F.getAddressSpace(), F.getName());
-
-    NF->copyAttributesFrom(&F);
-    NF->copyMetadata(&F, 0);
-    NF->setIsNewDbgInfoFormat(F.IsNewDbgInfoFormat);
-
-    F.getParent()->getFunctionList().insert(F.getIterator(), NF);
-    NF->takeName(&F);
-    NF->splice(NF->begin(), &F);
-
-    Function::arg_iterator NFArg = NF->arg_begin();
-    for (Argument &Arg : F.args()) {
-      Arg.replaceAllUsesWith(&*NFArg);
-      NFArg->takeName(&Arg);
-      ++NFArg;
-    }
-
-    AttrBuilder AB(Ctx);
-    AB.addAttribute(Attribute::InReg);
-    AB.addAttribute("amdgpu-hidden-argument");
-    AttributeList AL = NF->getAttributes();
-    for (unsigned I = 0; I <= LastPreloadIndex; ++I) {
-      AL = AL.addParamAttributes(Ctx, NFArg->getArgNo(), AB);
-      NFArg++->setName(getHiddenArgName(HiddenArg(I)));
-    }
-
-    NF->setAttributes(AL);
-    F.replaceAllUsesWith(NF);
-    F.setCallingConv(CallingConv::C);
-    F.clearMetadata();
-
-    return NF;
-  }
-
-public:
-  PreloadKernelArgInfo(Function &F, const GCNSubtarget &ST) : F(F), ST(ST) {
-    setInitialFreeUserSGPRsCount();
-  }
-
-  // Returns the maximum number of user SGPRs that we have available to preload
-  // arguments.
-  void setInitialFreeUserSGPRsCount() {
-    GCNUserSGPRUsageInfo UserSGPRInfo(F, ST);
-    NumFreeUserSGPRs = UserSGPRInfo.getNumFreeUserSGPRs();
-  }
-
-  bool tryAllocPreloadSGPRs(unsigned AllocSize, uint64_t ArgOffset,
-                            uint64_t LastExplicitArgOffset) {
-    //  Check if this argument may be loaded into the same register as the
-    //  previous argument.
-    if (ArgOffset - LastExplicitArgOffset < 4 &&
-        !isAligned(Align(4), ArgOffset))
-      return true;
-
-    // Pad SGPRs for kernarg alignment.
-    ArgOffset = alignDown(ArgOffset, 4);
-    unsigned Padding = ArgOffset - LastExplicitArgOffset;
-    unsigned PaddingSGPRs = alignTo(Padding, 4) / 4;
-    unsigned NumPreloadSGPRs = alignTo(AllocSize, 4) / 4;
-    if (NumPreloadSGPRs + PaddingSGPRs > NumFreeUserSGPRs)
-      return false;
-
-    NumFreeUserSGPRs -= (NumPreloadSGPRs + PaddingSGPRs);
-    return true;
-  }
-
-  // Try to allocate SGPRs to preload implicit kernel arguments.
-  void tryAllocImplicitArgPreloadSGPRs(uint64_t ImplicitArgsBaseOffset,
-                                       uint64_t LastExplicitArgOffset,
-                                       IRBuilder<> &Builder) {
-    Function *ImplicitArgPtr = Intrinsic::getDeclarationIfExists(
-        F.getParent(), Intrinsic::amdgcn_implicitarg_ptr);
-    if (!ImplicitArgPtr)
-      return;
-
-    const DataLayout &DL = F.getParent()->getDataLayout();
-    // Pair is the load and the load offset.
-    SmallVector<std::pair<LoadInst *, unsigned>, 4> ImplicitArgLoads;
-    for (auto *U : ImplicitArgPtr->users()) {
-      Instruction *CI = dyn_cast<Instruction>(U);
-      if (!CI || CI->getParent()->getParent() != &F)
-        continue;
-
-      for (auto *U : CI->users()) {
-        int64_t Offset = 0;
-        auto *Load = dyn_cast<LoadInst>(U); // Load from ImplicitArgPtr?
-        if (!Load) {
-          if (GetPointerBaseWithConstantOffset(U, Offset, DL) != CI)
-            continue;
-
-          Load = dyn_cast<LoadInst>(*U->user_begin()); // Load from GEP?
-        }
-
-        if (!Load || !Load->isSimple())
-          continue;
-
-        // FIXME: Expand to handle 64-bit implicit args and large merged loads.
-        LLVMContext &Ctx = F.getParent()->getContext();
-        Type *LoadTy = Load->getType();
-        HiddenArg HA = getHiddenArgFromOffset(Offset);
-        if (HA == END_HIDDEN_ARGS || LoadTy != getHiddenArgType(Ctx, HA))
-          continue;
-
-        ImplicitArgLoads.push_back(std::make_pair(Load, Offset));
-      }
-    }
-
-    if (ImplicitArgLoads.empty())
-      return;
-
-    // Allocate loads in order of offset. We need to be sure that the implicit
-    // argument can actually be preloaded.
-    std::sort(ImplicitArgLoads.begin(), ImplicitArgLoads.end(), less_second());
-
-    // If we fail to preload any implicit argument we know we don't have SGPRs
-    // to preload any subsequent ones with larger offsets. Find the first
-    // argument that we cannot preload.
-    auto *PreloadEnd = std::find_if(
-        ImplicitArgLoads.begin(), ImplicitArgLoads.end(),
-        [&](const std::pair<LoadInst *, unsigned> &Load) {
-          unsigned LoadSize = DL.getTypeStoreSize(Load.first->getType());
-          unsigned LoadOffset = Load.second;
-          if (!tryAllocPreloadSGPRs(LoadSize,
-                                    LoadOffset + ImplicitArgsBaseOffset,
-                                    LastExplicitArgOffset))
-            return true;
-
-          LastExplicitArgOffset =
-              ImplicitArgsBaseOffset + LoadOffset + LoadSize;
-          return false;
-        });
-
-    if (PreloadEnd == ImplicitArgLoads.begin())
-      return;
-
-    unsigned LastHiddenArgIndex = getHiddenArgFromOffset(PreloadEnd[-1].second);
-    Function *NF = cloneFunctionWithPreloadImplicitArgs(LastHiddenArgIndex);
-    assert(NF);
-    for (const auto *I = ImplicitArgLoads.begin(); I != PreloadEnd; ++I) {
-      LoadInst *LoadInst = I->first;
-      unsigned LoadOffset = I->second;
-      unsigned HiddenArgIndex = getHiddenArgFromOffset(LoadOffset);
-      unsigned Index = NF->arg_size() - LastHiddenArgIndex + HiddenArgIndex - 1;
-      Argument *Arg = NF->getArg(Index);
-      LoadInst->replaceAllUsesWith(Arg);
-    }
-  }
-};
-
 class AMDGPULowerKernelArguments : public FunctionPass {
 public:
   static char ID;
@@ -311,10 +86,6 @@ static bool lowerKernelArguments(Function &F, const TargetMachine &TM) {
       Attribute::getWithDereferenceableBytes(Ctx, TotalKernArgSize));
 
   uint64_t ExplicitArgOffset = 0;
-  // Preloaded kernel arguments must be sequential.
-  bool InPreloadSequence = true;
-  PreloadKernelArgInfo PreloadInfo(F, ST);
-
   for (Argument &Arg : F.args()) {
     const bool IsByRef = Arg.hasByRefAttr();
     Type *ArgTy = IsByRef ? Arg.getParamByRefType() : Arg.getType();
@@ -325,25 +96,10 @@ static bool lowerKernelArguments(Function &F, const TargetMachine &TM) {
     uint64_t AllocSize = DL.getTypeAllocSize(ArgTy);
 
     uint64_t EltOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + BaseOffset;
-    uint64_t LastExplicitArgOffset = ExplicitArgOffset;
     ExplicitArgOffset = alignTo(ExplicitArgOffset, ABITypeAlign) + AllocSize;
 
-    // Guard against the situation where hidden arguments have already been
-    // lowered and added to the kernel function signiture, i.e. in a situation
-    // where this pass has run twice.
-    if (Arg.hasAttribute("amdgpu-hidden-argument"))
-      break;
-
-    // Try to preload this argument into user SGPRs.
-    if (Arg.hasInRegAttr() && InPreloadSequence && ST.hasKernargPreload() &&
-        !Arg.getType()->isAggregateType())
-      if (PreloadInfo.tryAllocPreloadSGPRs(AllocSize, EltOffset,
-                                           LastExplicitArgOffset))
-        continue;
-
-    InPreloadSequence = false;
-
-    if (Arg.use_empty())
+    // Skip inreg arguments which should be preloaded.
+    if (Arg.use_empty() || Arg.hasInRegAttr())
       continue;
 
     // If this is byval, the loads are already explicit in the function. We just
@@ -483,14 +239,6 @@ static bool lowerKernelArguments(Function &F, const TargetMachine &TM) {
   KernArgSegment->addRetAttr(
       Attribute::getWithAlignment(Ctx, std::max(KernArgBaseAlign, MaxAlign)));
 
-  if (InPreloadSequence) {
-    uint64_t ImplicitArgsBaseOffset =
-        alignTo(ExplicitArgOffset, ST.getAlignmentForImplicitArgPtr()) +
-        BaseOffset;
-    PreloadInfo.tryAllocImplicitArgPreloadSGPRs(ImplicitArgsBaseOffset,
-                                                ExplicitArgOffset, Builder);
-  }
-
   return true;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 1050855176c04..ca9a4877907c8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -29,6 +29,7 @@ MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-remove-incompatible-functions", AMDGPURemoveIncompatibleFunctionsPass(*this))
 MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this))
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-preload-kernel-arguments", AMDGPUPreloadKernelArgumentsPass(*this))
 #undef MODULE_PASS
 
 #ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp
new file mode 100644
index 0000000000000..2b2f649836ebe
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp
@@ -0,0 +1,358 @@
+//===- AMDGPUPreloadKernelArguments.cpp - Preload Kernel Arguments --------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file This pass preloads kernel arguments into user_data SGPRs before kernel
+/// execution begins. The number of registers available for preloading depends
+/// on the number of free user SGPRs, up to the hardware's maximum limit.
+/// Implicit arguments enabled in the kernel descriptor are allocated first,
+/// followed by SGPRs used for preloaded kernel arguments. (Reference:
+/// https://llvm.org/docs/AMDGPUUsage.html#initial-kernel-execution-state)
+/// Additionally, hidden kernel arguments may be preloaded, in which case they
+/// are appended to the kernel signature after explicit arguments. Preloaded
+/// arguments will be marked with `inreg`.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/PassManager.h"
+#include "llvm/IR/Verifier.h"
+#include "llvm/Pass.h"
+
+#define DEBUG_TYPE "amdgpu-preload-kernel-arguments"
+
+using namespace llvm;
+
+static cl::opt<unsigned> KernargPreloadCount(
+    "amdgpu-kernarg-preload-count",
+    cl::desc("How many kernel arguments to preload onto SGPRs"), cl::init(0));
+
+namespace {
+
+class AMDGPUPreloadKernelArgumentsLegacy : public ModulePass {
+  const AMDGPUTargetMachine *TM;
+
+public:
+  static char ID;
+  explicit AMDGPUPreloadKernelArgumentsLegacy(
+      const AMDGPUTargetMachine *TM = nullptr);
+
+  StringRef getPassName() const override {
+    return "AMDGPU Preload Kernel Arguments";
+  }
+
+  bool runOnModule(Module &M) override;
+};
+
+class PreloadKernelArgInfo {
+private:
+  Function &F;
+  const GCNSubtarget &ST;
+  unsigned NumFreeUserSGPRs;
+
+  enum HiddenArg : unsigned {
+    HIDDEN_BLOCK_COUNT_X,
+    HIDDEN_BLOCK_COUNT_Y,
+    HIDDEN_BLOCK_COUNT_Z,
+    HIDDEN_GROUP_SIZE_X,
+    HIDDEN_GROUP_SIZE_Y,
+    HIDDEN_GROUP_SIZE_Z,
+    HIDDEN_REMAINDER_X,
+    HIDDEN_REMAINDER_Y,
+    HIDDEN_REMAINDER_Z,
+    END_HIDDEN_ARGS
+  };
+
+  // Stores information about a specific hidden argument.
+  struct HiddenArgInfo {
+    // Offset in bytes from the location in the kernearg segment pointed to by
+    // the implicitarg pointer.
+    uint8_t Offset;
+    // The size of the hidden argument in bytes.
+    uint8_t Size;
+    // The name of the hidden argument in the kernel signature.
+    const char *Name;
+  };
+
+  static constexpr HiddenArgInfo HiddenArgs[END_HIDDEN_ARGS] = {
+      {0, 4, "_hidden_block_count_x"}, {4, 4, "_hidden_block_count_y"},
+      {8, 4, "_hidden_block_count_z"}, {12, 2, "_hidden_group_size_x"},
+      {14, 2, "_hidden_group_size_y"}, {16, 2, "_hidden_group_size_z"},
+      {18, 2, "_hidden_remainder_x"},  {20, 2, "_hidden_remainder_y"},
+      {22, 2, "_hidden_remainder_z"}};
+
+  static HiddenArg getHiddenArgFromOffset(unsigned Offset) {
+    for (unsigned I = 0; I < END_HIDDEN_ARGS; ++I)
+      if (HiddenArgs[I].Offset == Offset)
+        return static_cast<HiddenArg>(I);
+
+    return END_HIDDEN_ARGS;
+  }
+
+  static Type *getHiddenArgType(LLVMContext &Ctx, HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS)
+      return Type::getIntNTy(Ctx, HiddenArgs[HA].Size * 8);
+
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  static const char *getHiddenArgName(HiddenArg HA) {
+    if (HA < END_HIDDEN_ARGS) {
+      return HiddenArgs[HA].Name;
+    }
+    llvm_unreachable("Unexpected hidden argument.");
+  }
+
+  // Clones the function after adding implicit arguments to the argument list
+  // and returns the new updated function. Preloaded implicit arguments are
+  // added up to and including the last one that will be preloaded, indicated by
+  // LastPreloadIndex. Currently preloading is only performed on the totality of
+  // sequential data from the kernarg segment including implicit (hidden)
+  // arguments. This means that all arguments up to the last preloaded argument
+  // will also be preloaded even if that data is unused.
+  Function *cloneFunctionWithPreloadImplicitArgs(unsigned LastPreloadIndex) {
+    FunctionType *FT = F.getFunctionType();
+    LLVMContext &Ctx = F.getParent()->getContext();
+    SmallVector<Type *, 16> FTypes(FT->param_begin(), FT->param_end());
+    for (unsigned I = 0; I <= LastPreloadIndex; ++I)
+      FTypes.push_back(getHiddenArgType(Ctx, HiddenArg(I)));
+
+    FunctionType *NFT =
+        FunctionType::get(FT->getReturnType(), FTypes, FT->isVarArg());
+    Function *NF =
+        Functio...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

kerbowa · 2025-04-07T15:29:11Z

If anyone has opinions about the best approach for this please let me know because we need to do something. There are three PRs up that are achieving the same end, but I don't want to just rebase/update all three. At different points I've seen interest in each approach based on discussion in the PRs.

@arsenm @shiltian @krzysz00

shiltian · 2025-04-07T15:35:08Z

I think this is the best way to handle it. Moving it to the attributor isn't acceptable to me, at least based on what we have right now. I'd need to clearly see what benefits we get from running it iteratively, and how it would actually work if we were to make it part of the attributor.

Moves kernarg preload logic to its own module pass. Cloned function declarations are removed when preloading hidden arguments. The inreg attribute is now added in this pass instead of AMDGPUAttributor. The rest of the logic is copied from AMDGPULowerKernelArguments which now only check whether an arguments is marked inreg to avoid replacing direct uses of preloaded arguments. This change requires test updates to remove inreg from lit tests with kernels that don't actually want preloading.

cdevadas · 2025-05-05T09:08:51Z

llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def

@@ -29,6 +29,7 @@ MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-remove-incompatible-functions", AMDGPURemoveIncompatibleFunctionsPass(*this))
 MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this))
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-preload-kernel-arguments", AMDGPUPreloadKernelArgumentsPass(*this))


Maintain the alphabetical order of the options here.

cdevadas · 2025-05-05T09:45:46Z

llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp

+}
+
+bool AMDGPUPreloadKernelArgumentsLegacy::runOnModule(Module &M) {
+  if (skipModule(M) || !TM)


This check is needed for NPM run method as well.

Realized that the skipModule(M) isn't required for NPM. It is automatically handled by the StandardInstrumentations. Still, the early return on empty TM is missing in the NPM run interface.

kerbowa requested review from arsenm and shiltian March 8, 2025 19:35

kerbowa marked this pull request as ready for review March 8, 2025 19:35

llvmbot added backend:AMDGPU llvm:globalisel labels Mar 8, 2025

shiltian reviewed Mar 9, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp Outdated Show resolved Hide resolved

shiltian reviewed Mar 9, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp Outdated Show resolved Hide resolved

kerbowa force-pushed the users/kerbowa/preload-kernarg-pass branch from 805e24c to a23a8e4 Compare March 10, 2025 04:33

kerbowa requested a review from krzysz00 April 7, 2025 15:23

kerbowa force-pushed the users/kerbowa/preload-kernarg-pass branch from a23a8e4 to 2e0ce8a Compare May 1, 2025 21:59

kerbowa requested a review from shiltian May 2, 2025 03:16

cdevadas reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Move kernarg preload logic to separate pass #130434

[AMDGPU] Move kernarg preload logic to separate pass #130434

kerbowa commented Mar 8, 2025

kerbowa commented Mar 8, 2025

llvmbot commented Mar 8, 2025 •

edited

Loading

kerbowa commented Apr 7, 2025

shiltian commented Apr 7, 2025

cdevadas May 5, 2025

cdevadas May 5, 2025

cdevadas May 5, 2025

[AMDGPU] Move kernarg preload logic to separate pass #130434

Are you sure you want to change the base?

[AMDGPU] Move kernarg preload logic to separate pass #130434

Conversation

kerbowa commented Mar 8, 2025

kerbowa commented Mar 8, 2025

llvmbot commented Mar 8, 2025 • edited Loading

kerbowa commented Apr 7, 2025

shiltian commented Apr 7, 2025

cdevadas May 5, 2025

Choose a reason for hiding this comment

cdevadas May 5, 2025

Choose a reason for hiding this comment

cdevadas May 5, 2025

Choose a reason for hiding this comment

llvmbot commented Mar 8, 2025 •

edited

Loading