-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AMDGPU] Automatic conversion from wave32 to wave64 #137376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: None (alex-t) ChangesSmall short living kernels may become waveslot limited. To work around the problem an optimization is proposed to convert such kernels from wave32 to wave64 automatically. These kernels shall conform to a strict set of limitations and satisfy profitability conditions.
Patch is 20.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137376.diff 7 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 4ff761ec19b3c..76ef87ba44913 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -51,6 +51,7 @@ FunctionPass *createSIMemoryLegalizerPass();
FunctionPass *createSIInsertWaitcntsPass();
FunctionPass *createSIPreAllocateWWMRegsLegacyPass();
FunctionPass *createSIFormMemoryClausesLegacyPass();
+FunctionPass *createSIConvertWaveSizeLegacyPass(const TargetMachine *);
FunctionPass *createSIPostRABundlerPass();
FunctionPass *createAMDGPUImageIntrinsicOptimizerPass(const TargetMachine *);
@@ -174,6 +175,9 @@ extern char &SIShrinkInstructionsLegacyID;
void initializeSIFixSGPRCopiesLegacyPass(PassRegistry &);
extern char &SIFixSGPRCopiesLegacyID;
+void initializeSIConvertWaveSizeLegacyPass(PassRegistry &);
+extern char &SIConvertWaveSizeLegacyID;
+
void initializeSIFixVGPRCopiesLegacyPass(PassRegistry &);
extern char &SIFixVGPRCopiesID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 98a1147ef6d66..0cbd3ef8da761 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -67,6 +67,7 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
AMDGPUUnifyDivergentExitNodesPass())
FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("si-convert-wave-size", SIConvertWaveSizePass(*static_cast<const GCNTargetMachine *>(this)))
#undef FUNCTION_PASS
#ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index b6cc5137d711a..5be1640fd3db6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -44,6 +44,7 @@
#include "R600TargetMachine.h"
#include "SIFixSGPRCopies.h"
#include "SIFixVGPRCopies.h"
+#include "SIConvertWaveSize.h"
#include "SIFoldOperands.h"
#include "SIFormMemoryClauses.h"
#include "SILoadStoreOptimizer.h"
@@ -506,6 +507,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeSILowerSGPRSpillsLegacyPass(*PR);
initializeSIFixSGPRCopiesLegacyPass(*PR);
initializeSIFixVGPRCopiesLegacyPass(*PR);
+ initializeSIConvertWaveSizeLegacyPass(*PR);
initializeSIFoldOperandsLegacyPass(*PR);
initializeSIPeepholeSDWALegacyPass(*PR);
initializeSIShrinkInstructionsLegacyPass(*PR);
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 09a3096602fc3..663361face090 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -150,6 +150,7 @@ add_llvm_target(AMDGPUCodeGen
SIAnnotateControlFlow.cpp
SIFixSGPRCopies.cpp
SIFixVGPRCopies.cpp
+ SIConvertWaveSize.cpp
SIFoldOperands.cpp
SIFormMemoryClauses.cpp
SIFrameLowering.cpp
diff --git a/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp
new file mode 100644
index 0000000000000..4f5b839000c77
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.cpp
@@ -0,0 +1,321 @@
+//===- SIConvertWaveSize.cpp - Automatically converts wave32 kernels to wave64
+//---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 WITH LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+// Small short living kernels may become waveslot limited.
+// To work around the problem an optimization is proposed to convert such
+// kernels from wave32 to wave64 automatically.These kernels shall conform to a
+// strict set of limitations and satisfy profitability conditions.
+//
+// 1. A kernel shall have no function calls as we cannot analyze call stack
+// requirements (nor will it fall into a category of short living kernels
+// anyway).
+// 2. A kernel itself shall not be called from a device enqueue call.
+// 3. A kernel shall not attempt to access EXEC or VCC in any user visible
+// way.
+// 4. A kernel must not use readlane/readfirstlane or any cross-lane/DPP
+// operations in general.
+// 5. A kernel shall not read wavefront size or use ballot through
+// intrinsics (a use of pre-defined frontend wave size macro was deemed
+// permissible for now).
+// 6. There shall be no atomic operations of any sort as these may be used
+// for cross-thread communication.
+// 7. There shall be no LDS access as the allocation is usually tied to the
+// workgroup size and we generally cannot extend it. It is also changing
+// occupancy which is tied to the wave size.
+// 8. There shall be no inline asm calls.
+// 9 .There shall be no dynamic VGPRs.
+// 10 .Starting from GFX11 some instructions (such as WMMA on GFX11+ and
+// transpose loads on GFX12+) work differently (have different operands) in
+// wave32 and wave64. The kernel shall not have intrinsics to invoke such
+// instructions.
+
+#include "SIConvertWaveSize.h"
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/Analysis/ScalarEvolutionExpressions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/InitializePasses.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "si-convert-wave-size"
+
+namespace {
+class SIConvertWaveSize {
+ const TargetMachine *TM;
+ const LoopInfo *LI;
+ ScalarEvolution *SE;
+ TargetTransformInfo *TTI;
+
+ InstructionCost TotalCost = 0;
+
+ static const unsigned MaxLatency = 2000;
+
+ SmallVector<Function *> Callees;
+
+public:
+ SIConvertWaveSize(const TargetMachine *TM, const LoopInfo *LI,
+ ScalarEvolution *SE, TargetTransformInfo *TTI)
+ : TM(TM), LI(LI), SE(SE), TTI(TTI) {}
+
+ bool run(Function &F);
+
+ bool changeWaveSizeAttr(Function *F);
+};
+
+class SIConvertWaveSizeLegacy : public FunctionPass {
+ const TargetMachine *TM;
+
+public:
+ static char ID;
+ SIConvertWaveSizeLegacy(const TargetMachine *TM) : FunctionPass(ID), TM(TM) {}
+ bool runOnFunction(Function &F) override {
+ auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
+ auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
+ auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
+ SIConvertWaveSize Impl(TM, &LI, &SE, &TTI);
+ return Impl.run(F);
+ }
+ StringRef getPassName() const override { return "SI convert wave size"; }
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<LoopInfoWrapperPass>();
+ AU.addRequired<ScalarEvolutionWrapperPass>();
+ AU.setPreservesAll();
+ FunctionPass::getAnalysisUsage(AU);
+ }
+};
+} // end anonymous namespace
+
+void printFunctionAttributes(const Function &F) {
+ LLVM_DEBUG(dbgs() << "Function: " << F.getName() << "\n");
+ for (const auto &Attr : F.getAttributes()) {
+ LLVM_DEBUG(dbgs() << " Attribute: " << Attr.getAsString() << "\n");
+ }
+}
+
+bool SIConvertWaveSize::run(Function &F) {
+ LLVM_DEBUG(dbgs() << "Running SIConvertWaveSize on function: " << F.getName() << "\n");
+ LLVM_DEBUG(printFunctionAttributes(F));
+
+ const GCNSubtarget &ST = TM->getSubtarget<GCNSubtarget>(F);
+ if (ST.getGeneration() < AMDGPUSubtarget::GFX11)
+ return false;
+
+ // Check if the function is a kernel.
+ if (F.getCallingConv() != CallingConv::AMDGPU_KERNEL)
+ return false;
+
+ // Check if the kernel is wave32
+ if (F.hasFnAttribute("target-features")) {
+ if (!F.getFnAttribute("target-features")
+ .getValueAsString().contains("wavefrontsize32")) {
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Kernel is not wave32.\n");
+ return false;
+ }
+ }
+
+ // Check if the function is a device enqueue call.
+ if (F.hasFnAttribute("amdgpu-device-enqueue")) {
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Device enqueue call detected.\n");
+ return false;
+ }
+
+ // Check if a trip count is a compile time constant for all loops in the
+ // kernel
+ for (Loop *L : *LI) {
+ const SCEV *TripCountSCEV = SE->getBackedgeTakenCount(L);
+ if (!isa<SCEVConstant>(TripCountSCEV)) {
+ LLVM_DEBUG(
+ dbgs() << "SIConvertWaveSize: Trip count is not a compile time "
+ "constant.\n");
+ return false;
+ }
+ }
+
+ for (const auto &BB : F) {
+ InstructionCost BlockCost = 0;
+ for (const auto &I : BB) {
+ if (const CallBase *CB = dyn_cast<CallBase>(&I)) {
+ // FIXME: Any calls are not allowed. Only non-converged intrinsic clls
+ // and amdgsn_s_barrier are exempt. InlineAsm and Atomics are checkedd
+ // separately for debug purposes. This will be changed in the final
+ // version.
+ if (CB->isInlineAsm()) {
+ // Inline assembly is not allowed.
+ LLVM_DEBUG(dbgs()
+ << "SIConvertWaveSize: Inline assembly detected.\n");
+ return false;
+ }
+ if (CB->isAtomic()) {
+ // Atomic operations are not allowed.
+ LLVM_DEBUG(dbgs()
+ << "SIConvertWaveSize: Atomic operation detected.\n");
+ return false;
+ }
+ if (Function *Callee = CB->getCalledFunction()) {
+ // assuming readlane/readfirstlane or any cross-lane/DPP
+ // operations have "let isConvergent = 1" in IntrinsicsAMDGPU.td
+ if (Callee->isIntrinsic()) {
+ if (Callee->hasFnAttribute(Attribute::Convergent)) {
+ if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
+ // TODO: what else should go in a "white list" ?
+ // Intrinsic::amdgcn_s_barrier_wavefront ?
+ // Intrinsic::amdgcn_s_barrier_signal ?
+ LLVM_DEBUG(dbgs()
+ << "SIConvertWaveSize: Convergent intrinsic "
+ << Callee->getName() << " detected.\n");
+ return false;
+ }
+ }
+
+ if (Callee->getIntrinsicID() == Intrinsic::read_register) {
+ if (const auto *MDVal =
+ dyn_cast<MetadataAsValue>(CB->getArgOperand(0))) {
+ Metadata *MD = MDVal->getMetadata();
+ if (auto *MDNodeVal = dyn_cast<MDNode>(MD)) {
+ if (MDNodeVal->getNumOperands() >= 1) {
+ if (auto *MDStr =
+ dyn_cast<MDString>(MDNodeVal->getOperand(0))) {
+ if (MDStr->getString().starts_with("exec") ||
+ MDStr->getString().starts_with("vcc")) {
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: read_register("
+ << MDStr->getString()
+ << ") intrinsic detected.\n");
+ return false;
+ }
+ }
+ }
+ }
+ }
+ }
+
+ // Save callee as a candidate for attribute change
+ Callees.push_back(Callee);
+ }
+ } else {
+ // General calls are not allowed.
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: function call detected.\n");
+ return false;
+ }
+ }
+ // No LDS access is allowed
+ if (auto LI = dyn_cast<LoadInst>(&I)) {
+ if (LI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access detected.\n");
+ return false;
+ }
+ }
+ if (auto SI = dyn_cast<StoreInst>(&I)) {
+ if (SI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access detected.\n");
+ return false;
+ }
+ }
+ // TODO: All atomics are not allowed?
+ // if (auto AI = dyn_cast<AtomicRMWInst>(&I)) {
+ // if (AI->getPointerAddressSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+ // LLVM_DEBUG(dbgs() << "SIConvertWaveSize: LDS access
+ // detected.\n"); return false;
+ // }
+ // }
+
+ // TODO: Dynamic VGPRS and GFX11+ special operations ???
+ BlockCost +=
+ TTI->getInstructionCost(&I, TargetTransformInfo::TCK_RecipThroughput);
+ }
+ if (auto L = LI->getLoopFor(&BB)) {
+ const SCEV *TripCount = SE->getBackedgeTakenCount(L);
+ if (auto *C = dyn_cast<SCEVConstant>(TripCount)) {
+ uint64_t TC = C->getValue()->getZExtValue() + 1;
+ size_t Depth = LI->getLoopDepth(&BB);
+ BlockCost *= TC * Depth;
+ } else
+ llvm_unreachable("SIConvertWaveSize: only loops with compile time "
+ "constant trip count could reach here!\n");
+ }
+ TotalCost += BlockCost;
+ if (TotalCost.isValid()) {
+ if (TotalCost.getValue().value() >= MaxLatency) {
+ LLVM_DEBUG(
+ dbgs() << "SIConvertWaveSize: Total latency of the kernel ["
+ << TotalCost.getValue().value()
+ << "] exceeds the limit of 2000 cycles - not profitable!\n");
+ return false;
+ }
+ } else
+ llvm_unreachable(
+ "SIConvertWaveSize: Cost model error - invalid state!\n");
+ }
+
+ // Additional checks can be added here...
+
+ // If all checks pass, convert wave size from wave32 to wave64.
+ // Conversion logic goes here...
+ bool Changed = changeWaveSizeAttr(&F);
+ if (Changed)
+ // Now take care of the intrinsic calls
+ for (auto C : Callees) {
+ // TODO: if we could not change Attr for one of the callee
+ // we need to rollback all the changes!
+ changeWaveSizeAttr(C);
+ }
+
+ return Changed;
+ }
+
+bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) {
+ auto Attr = F->getFnAttribute("target-features");
+ if (Attr.isValid()) {
+ StringRef AttrStr = Attr.getValueAsString();
+ size_t Pos = AttrStr.find("+wavefrontsize32");
+ if (Pos != StringRef::npos) {
+ // Remove the "+wavefrontsize32" attribute.
+ std::string NewBegin = AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
+ std::string End = AttrStr.substr(Pos + strlen("+wavefrontsize32")).str();
+ std::string NewAttrStr = NewBegin + End;
+ // Add the "+wavefrontsize64" attribute.
+ F->removeFnAttr("target-features");
+ F->addFnAttr("target-features", NewAttrStr);
+ LLVM_DEBUG(dbgs() << "SIConvertWaveSize: Converted wave size for "
+ << F->getName()
+ << " from wave32 "
+ "to wave64.\n");
+ return true;
+ }
+ }
+ return false;
+}
+
+INITIALIZE_PASS_BEGIN(SIConvertWaveSizeLegacy, DEBUG_TYPE, "SI convert wave size",
+ false, false)
+INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
+INITIALIZE_PASS_END(SIConvertWaveSizeLegacy, DEBUG_TYPE, "SI convert wave size",
+ false, false)
+
+char SIConvertWaveSizeLegacy::ID = 0;
+
+char &llvm::SIConvertWaveSizeLegacyID = SIConvertWaveSizeLegacy::ID;
+
+FunctionPass *llvm::createSIConvertWaveSizeLegacyPass(const TargetMachine *TM) {
+ return new SIConvertWaveSizeLegacy(TM);
+}
+
+PreservedAnalyses SIConvertWaveSizePass::run(
+ Function &F, FunctionAnalysisManager &FAM) {
+ auto &LI = FAM.getResult<LoopAnalysis>(F);
+ auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
+ auto &TTI = FAM.getResult<TargetIRAnalysis>(F);
+
+ SIConvertWaveSize Impl(TM, &LI, &SE, &TTI);
+ bool Changed = Impl.run(F);
+ return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h
new file mode 100644
index 0000000000000..78b8365ed9ebc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/SIConvertWaveSize.h
@@ -0,0 +1,30 @@
+//===- SIConvertWaveSize.h ----------------------------------------*- C++- *-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+#ifndef LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
+#define LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
+
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/Analysis/ScalarEvolution.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+namespace llvm {
+
+class SIConvertWaveSizePass : public PassInfoMixin<SIConvertWaveSizePass> {
+ /// The target machine.
+ const TargetMachine *TM;
+
+public:
+ SIConvertWaveSizePass(const TargetMachine &TM)
+ : TM(&TM) {}
+ PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_SICONVERTWAVESIZE_H
diff --git a/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll b/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll
new file mode 100644
index 0000000000000..d90e524e9cc2e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/wave32-to-64-auto-convert.ll
@@ -0,0 +1,121 @@
+; RUN: opt -S -mtriple=amdgcn -mcpu=gfx1100 -passes=si-convert-wave-size < %s | FileCheck %s
+
+define amdgpu_kernel void @test_not_wave32(ptr addrspace(1) %out) #0 {
+ ; CHECK: @test_not_wave32{{.*}}) #0
+ %gep = getelementptr i32, ptr addrspace(1) %out, i32 2
+ %tmp = load i32, ptr addrspace(1) %gep
+ store i32 %tmp, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @intr_non_convergent(ptr addrspace(1) nocapture %arg) #1 {
+ ; CHECK: @intr_non_convergent{{.*}} #0
+bb:
+ %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
+ %tmp1 = icmp ugt i32 %tmp, 32
+ %tmp2 = select i1 %tmp1, i32 2, i32 1
+ store i32 %tmp2, ptr addrspace(1) %arg
+ ret void
+}
+
+define amdgpu_kernel void @intr_convergent(ptr addrspace(1) nocapture %arg, i32 %X) #1 {
+ ; CHECK: @intr_convergent{{.*}}) #1
+bb:
+ %tmp = icmp ugt i32 %X, 32
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %tmp)
+ store i32 %ballot, ptr addrspace(1) %arg
+ ret void
+}
+
+define amdgpu_kernel void @test_barrier(ptr addrspace(1) %in, ptr addrspace(1) %out) #1 {
+ ; CHECK: @test_barrier{{.*}}) #0
+entry:
+ %val = load <2 x half>, ptr addrspace(1) %in
+ call void @llvm.amdgcn.s.barrier() #2
+ store <2 x half> %val, ptr addrspace(1) %out
+ ret void
+}
+
+
+define amdgpu_kernel void @test_read_exec(ptr addrspace(1) %out) #1 {
+ ; CHECK: @test_read_exec{{.*}}) #1
+ %exec = call i64 @llvm.read_register.i64(metadata !0)
+ store i64 %exec, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @test_read_vcc_lo(ptr addrspace(1) %out) #1 {
+ ; CHECK: @test_read_vcc_lo{{.*}}) #1
+ %vcc_lo = call i32 @llvm.read_register.i32(metadata !1)
+ store i32 %vcc_lo, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @test_read_vcc_hi(ptr addrspace(1) %out) #1 {
+ ; CHECK: @test_read_vcc_hi{{.*}}) #1
+ %vcc_hi = call i32 @llvm.read_register.i32(metadata !2)
+ store i32 %vcc_hi, ptr addrspace(1) %out
+ ret void
+}
+
+define amdgpu_kernel void @test_lds_access(ptr addrspace(3) %out) #1 {
+ ; CHECK: @test_lds_access{{.*}}) #1
+ %gep = getelementptr i32, ptr addrspace(3) %out, i32 2
+ %tmp = load i32, ptr addrspace(3) %gep
+ store i32 %tmp, ptr addrspace(3) %out
+ ret void
+}
+
+define amdgpu_kernel void @test_simple_loop(ptr addrspace(1) nocapture %arg) #1 {
+ ; CHECK: @test_simple_loop{{.*}}) #1
+bb:
+ br label %bb2
+
+bb1:
+ ret void
+
+bb2:
+ %tmp1 = phi i32 [ 0, %bb ], [ %tmp2, %bb2 ]
+ %tmp2 = add nuw nsw i32 %tmp1, 1
+ %tmp3 = icmp eq i32 %tmp2, 1024
+ tail call void @llvm.amdgcn.s.sleep(i32 0)
+ br i1 %tmp3, label %bb1, label %bb2
+}
+
+define amdgpu_kernel void @test_nested_loop(ptr addrspace(1) nocapture %arg) #1 {
+ ; CHECK: @test_nested_loop{{.*}}) #1
+bb:
+ br label %bb2
+
+bb1:
+ ret void
+
+bb2:
+ %tmp1 = phi i32 [ 0, %bb ], [ %tmp2, %bb4 ]
+ %tmp2 = add nuw nsw i32 %tmp1, 1
+ %tmp3 = icmp eq i32 %tmp2, 8
+ br label %bb3
+
+bb3:
+ %tmp4 = phi i32 [ 0, %bb2 ], [ %tmp5, %bb3 ]
+ %tmp5 = add nuw nsw i32 %tmp4, 1
+ %tmp6 = icmp eq i32 %tmp5, 128
+ tail call void @llv...
[truncated]
|
This is an early draft and was created as a discussion board to elaborate the proper approach and solutions. |
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
index c166def57..df799f1d4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.cpp
@@ -77,7 +77,8 @@ class AMDGPUConvertWaveSizeLegacy : public FunctionPass {
public:
static char ID;
- AMDGPUConvertWaveSizeLegacy(const GCNTargetMachine *TM) : FunctionPass(ID), TM(TM) {}
+ AMDGPUConvertWaveSizeLegacy(const GCNTargetMachine *TM)
+ : FunctionPass(ID), TM(TM) {}
bool runOnFunction(Function &F) override {
auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
@@ -181,84 +182,83 @@ bool AMDGPUConvertWaveSize::run(Function &F) {
// assuming readlane/readfirstlane or any cross-lane/DPP
// operations have "let isConvergent = 1" in IntrinsicsAMDGPU.td
if (Callee->isIntrinsic()) {
- if (Callee->hasFnAttribute(Attribute::Convergent)) {
- if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
- // TODO: what else should go in a "white list" ?
- // Intrinsic::amdgcn_s_barrier_wavefront ?
- // Intrinsic::amdgcn_s_barrier_signal ?
- LLVM_DEBUG(dbgs()
- << "AMDGPUConvertWaveSize: Convergent intrinsic "
- << Callee->getName() << " detected.\n");
- return false;
- }
- }
-
- if (Callee->getIntrinsicID() == Intrinsic::read_register ||
- Callee->getIntrinsicID() == Intrinsic::write_register) {
-
+ if (Callee->hasFnAttribute(Attribute::Convergent)) {
+ if (Callee->getIntrinsicID() != Intrinsic::amdgcn_s_barrier) {
+ // TODO: what else should go in a "white list" ?
+ // Intrinsic::amdgcn_s_barrier_wavefront ?
+ // Intrinsic::amdgcn_s_barrier_signal ?
LLVM_DEBUG(dbgs()
- << "AMDGPUConvertWaveSize: read/write_register "
- "intrinsic detected.\n");
+ << "AMDGPUConvertWaveSize: Convergent intrinsic "
+ << Callee->getName() << " detected.\n");
return false;
}
-
- // Take care of LDS access
- if (const auto *MTI = dyn_cast<MemTransferInst>(&I)) {
- auto DstAS = MTI->getDestAddressSpace();
- auto SrcAS = MTI->getSourceAddressSpace();
- if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
- SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
- LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
- "(llvm.memcpy/memmove) detected.\n");
- return false;
- }
- } else if (const auto *MSI = dyn_cast<MemSetInst>(&I)) {
- auto DstAS = MSI->getDestAddressSpace();
- if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
- LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
- "(llvm.memset) detected.\n");
- return false;
- }
- } else if (const auto AMCI = dyn_cast<AtomicMemCpyInst>(&I)) {
- auto DstAS = AMCI->getDestAddressSpace();
- auto SrcAS = AMCI->getSourceAddressSpace();
- if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
- SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
- LLVM_DEBUG(
- dbgs()
- << "AMDGPUConvertWaveSize: LDS access "
- "(llvm.memcpy.element.unordered.atomic) detected\n");
- return false;
- }
- } else
- if (const auto AMMI = dyn_cast<AtomicMemMoveInst>(&I)) {
- auto DstAS = AMMI->getDestAddressSpace();
- auto SrcAS = AMMI->getSourceAddressSpace();
- if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
- SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
- LLVM_DEBUG(
- dbgs()
- << "AMDGPUConvertWaveSize: LDS access "
- "(llvm.memmove.element.unordered.atomic) detected.\n");
- return false;
- }
- } else if (const auto *AMSI = dyn_cast<AtomicMemSetInst>(&I)) {
- auto DstAS = AMSI->getDestAddressSpace();
- if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
- LLVM_DEBUG(
- dbgs()
- << "AMDGPUConvertWaveSize: LDS access "
- "(llvm.memset.element.unordered.atomic) detected.\n");
- return false;
- }
+ }
+
+ if (Callee->getIntrinsicID() == Intrinsic::read_register ||
+ Callee->getIntrinsicID() == Intrinsic::write_register) {
+
+ LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: read/write_register "
+ "intrinsic detected.\n");
+ return false;
+ }
+
+ // Take care of LDS access
+ if (const auto *MTI = dyn_cast<MemTransferInst>(&I)) {
+ auto DstAS = MTI->getDestAddressSpace();
+ auto SrcAS = MTI->getSourceAddressSpace();
+ if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+ SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
+ "(llvm.memcpy/memmove) detected.\n");
+ return false;
+ }
+ } else if (const auto *MSI = dyn_cast<MemSetInst>(&I)) {
+ auto DstAS = MSI->getDestAddressSpace();
+ if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: LDS access "
+ "(llvm.memset) detected.\n");
+ return false;
}
+ } else if (const auto AMCI = dyn_cast<AtomicMemCpyInst>(&I)) {
+ auto DstAS = AMCI->getDestAddressSpace();
+ auto SrcAS = AMCI->getSourceAddressSpace();
+ if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+ SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(
+ dbgs()
+ << "AMDGPUConvertWaveSize: LDS access "
+ "(llvm.memcpy.element.unordered.atomic) detected\n");
+ return false;
+ }
+ } else if (const auto AMMI = dyn_cast<AtomicMemMoveInst>(&I)) {
+ auto DstAS = AMMI->getDestAddressSpace();
+ auto SrcAS = AMMI->getSourceAddressSpace();
+ if (DstAS == AMDGPUAS::LOCAL_ADDRESS ||
+ SrcAS == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(
+ dbgs()
+ << "AMDGPUConvertWaveSize: LDS access "
+ "(llvm.memmove.element.unordered.atomic) detected.\n");
+ return false;
+ }
+ } else if (const auto *AMSI = dyn_cast<AtomicMemSetInst>(&I)) {
+ auto DstAS = AMSI->getDestAddressSpace();
+ if (DstAS == AMDGPUAS::LOCAL_ADDRESS) {
+ LLVM_DEBUG(
+ dbgs()
+ << "AMDGPUConvertWaveSize: LDS access "
+ "(llvm.memset.element.unordered.atomic) detected.\n");
+ return false;
+ }
+ }
// Save callee as a candidate for attribute change
Callees.push_back(Callee);
}
} else {
// General calls are not allowed.
- LLVM_DEBUG(dbgs() << "AMDGPUConvertWaveSize: function call detected.\n");
+ LLVM_DEBUG(dbgs()
+ << "AMDGPUConvertWaveSize: function call detected.\n");
return false;
}
}
@@ -309,9 +309,9 @@ bool AMDGPUConvertWaveSize::run(Function &F) {
if (const auto MemIntr = dyn_cast<MemIntrinsic>(&I))
- // TODO: Dynamic VGPRS and GFX11+ special operations ???
- BlockCost +=
- TTI->getInstructionCost(&I, TargetTransformInfo::TCK_Latency);
+ // TODO: Dynamic VGPRS and GFX11+ special operations ???
+ BlockCost +=
+ TTI->getInstructionCost(&I, TargetTransformInfo::TCK_Latency);
}
if (auto L = LI->getLoopFor(&BB)) {
const SCEV *TripCount = SE->getBackedgeTakenCount(L);
@@ -360,7 +360,8 @@ bool AMDGPUConvertWaveSize::changeWaveSizeAttr(Function *F) {
size_t Pos = AttrStr.find("+wavefrontsize32");
if (Pos != StringRef::npos) {
// Remove the "+wavefrontsize32" attribute.
- std::string NewBegin = AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
+ std::string NewBegin =
+ AttrStr.substr(0, Pos).str().append("+wavefrontsize64");
std::string End = AttrStr.substr(Pos + strlen("+wavefrontsize32")).str();
std::string NewAttrStr = NewBegin + End;
// Add the "+wavefrontsize64" attribute.
@@ -376,23 +377,24 @@ bool AMDGPUConvertWaveSize::changeWaveSizeAttr(Function *F) {
return false;
}
-INITIALIZE_PASS_BEGIN(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE, "AMDGPU convert wave size",
- false, false)
+INITIALIZE_PASS_BEGIN(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE,
+ "AMDGPU convert wave size", false, false)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
-INITIALIZE_PASS_END(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE, "AMDGPU convert wave size",
- false, false)
+INITIALIZE_PASS_END(AMDGPUConvertWaveSizeLegacy, DEBUG_TYPE,
+ "AMDGPU convert wave size", false, false)
char AMDGPUConvertWaveSizeLegacy::ID = 0;
char &llvm::AMDGPUConvertWaveSizeLegacyID = AMDGPUConvertWaveSizeLegacy::ID;
-FunctionPass *llvm::createAMDGPUConvertWaveSizeLegacyPass(const GCNTargetMachine *TM) {
+FunctionPass *
+llvm::createAMDGPUConvertWaveSizeLegacyPass(const GCNTargetMachine *TM) {
return new AMDGPUConvertWaveSizeLegacy(TM);
}
-PreservedAnalyses AMDGPUConvertWaveSizePass::run(
- Function &F, FunctionAnalysisManager &FAM) {
+PreservedAnalyses AMDGPUConvertWaveSizePass::run(Function &F,
+ FunctionAnalysisManager &FAM) {
auto &LI = FAM.getResult<LoopAnalysis>(F);
auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
auto &TTI = FAM.getResult<TargetIRAnalysis>(F);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
index e5b8c92c0..88152369e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUConvertWaveSize.h
@@ -1,4 +1,5 @@
-//===- SIConvertWaveSize.h ----------------------------------------*- C++- *-===//
+//===- SIConvertWaveSize.h ----------------------------------------*- C++-
+//*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
@@ -16,13 +17,13 @@
namespace llvm {
-class AMDGPUConvertWaveSizePass : public PassInfoMixin<AMDGPUConvertWaveSizePass> {
+class AMDGPUConvertWaveSizePass
+ : public PassInfoMixin<AMDGPUConvertWaveSizePass> {
/// The target machine.
const GCNTargetMachine *TM;
public:
- AMDGPUConvertWaveSizePass(const GCNTargetMachine &TM)
- : TM(&TM) {}
+ AMDGPUConvertWaveSizePass(const GCNTargetMachine &TM) : TM(&TM) {}
PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
};
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 6b214beb4..3dee8d09b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -17,6 +17,7 @@
#include "AMDGPUTargetMachine.h"
#include "AMDGPU.h"
#include "AMDGPUAliasAnalysis.h"
+#include "AMDGPUConvertWaveSize.h"
#include "AMDGPUCtorDtorLowering.h"
#include "AMDGPUExportClustering.h"
#include "AMDGPUExportKernelRuntimeHandles.h"
@@ -44,7 +45,6 @@
#include "R600TargetMachine.h"
#include "SIFixSGPRCopies.h"
#include "SIFixVGPRCopies.h"
-#include "AMDGPUConvertWaveSize.h"
#include "SIFoldOperands.h"
#include "SIFormMemoryClauses.h"
#include "SILoadStoreOptimizer.h"
|
A few questions to discuss right now:
|
You always know the wave size as long as you have subtarget. I assume the attribute shall be always set by FE.
I am not sure what do you mean, but it sounds you are referring to intrinsics accessing wave size since these need such mangling. That shall not be allowed. |
} | ||
|
||
// Check if the function is a device enqueue call. | ||
if (F.hasFnAttribute("amdgpu-device-enqueue")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find such attribute in our sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just forgot to comment it out. It does not exist yet. The placeholder was added to mark an idea to discuss: could we make the FE to set such an attribute basing on the @llvm.amdgcn.dispatch.ptr, @llvm.amdgcn.implicitarg.ptr and other relevant attributes presence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe kernel just shall have no references to its symbol, i.t. its address shall not escape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can only do negative usage detection in AMDGPUAttributor. Attributes that specify a feature are positively used are broken, we don't have any of those anymore.
} | ||
} | ||
// No LDS access is allowed | ||
if (auto LI = dyn_cast<LoadInst>(&I)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Load or store to local address is not a sufficient check. LDS may be addressed by flat. The kernel shall have no references to LDS objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions...
- We can pass a flat pointer as an argument.
- We cannot check that it does not point to LDS.
- "noalias" attribute does not help.
- AMDGPU::addrspacesMayAlias states that the Flat may alias to anything. - On the other hand, it would be ABI violation if kernel argument aliased the object in LDS.
- We can have a non-pointer (i32) argument and int2ptr to addrspace 3 - trivial to check
- addrspacecast to addrspace 3 - trivial to check
- Any arbitrary integer expression cast to flat pointer. - We cannot check if it won't point to LDS (probably corner case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're restricting this to kernels, it's easier to check if there is no allocation on the kernel (e.g. check amdgpu-lds-size, would need to double check how this interacts with the dynamic LDS case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot pass a pointer to LDS into a kernel. There is no way to allocate it. But dynamic LDS shall be disallowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But a kernel can have an LDS allocation as a kernel argument, as how it works in opencl. It has no content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's what I mean. Transformation shall be disabled if we have dynamic LDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current change, regarding LDS access check is a draft aiming to figure out - does it look like a right approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current change, regarding LDS access check is a draft aiming to figure out - does it look like a right approach?
You do not need to check all possible memory instructions, just check if LDS globals are used anywhere, and also that dynamic LDS is not used.
} | ||
|
||
bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) { | ||
auto Attr = F->getFnAttribute("target-features"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition you need to change AMDGPUSubtarget::WavefrontSizeLog2. ST.getWavefrontSize() should return correct value.
@@ -0,0 +1,121 @@ | |||
; RUN: opt -S -mtriple=amdgcn -mcpu=gfx1100 -passes=si-convert-wave-size < %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to test full compilation and check resulting .amdhsa_wavefront_size32 value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be easier to implement in attributor
// 1. A kernel shall have no function calls as we cannot analyze call stack | ||
// requirements (nor will it fall into a category of short living kernels | ||
// anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you did this in the attributor and could convert the entire call graph this wouldn't be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed though. A cost of one call will make the whole transformation useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call itself isn't necessarily expensive, but the attributor would also take care of most of these queries even within a single kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are talking about quite low latency threshold.
void printFunctionAttributes(const Function &F) { | ||
LLVM_DEBUG(dbgs() << "Function: " << F.getName() << "\n"); | ||
for (const auto &Attr : F.getAttributes()) { | ||
LLVM_DEBUG(dbgs() << " Attribute: " << Attr.getAsString() << "\n"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was left for the purpose as this is an early draft. I will remove it once the debugging is finished.
return Changed; | ||
} | ||
|
||
bool SIConvertWaveSize::changeWaveSizeAttr(Function *F) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't be trying to parse or find the position, you can simply append the new wavesize feature at the end
bb4: | ||
br i1 %tmp3, label %bb1, label %bb2 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test atomicrmw + cmpxchg + memory intrinsics
F->removeFnAttr("target-features"); | ||
F->addFnAttr("target-features", NewAttrStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just doing addFnAttr should override the old value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also probably should be an opt-in feature... I don't think we have any API rules that prevent you from assuming the wavesize will match what you compiled for. At the very least we need an explicit opt-out mechanism
@llvm.read_register.i32( some regular reg) is allowed as it is not convergent and does not read/write exec or vcc. We should use llvm.read_register.i64 instead if convert the caller kernel to wave64, otherwise it only will read a low half of the register. |
You cannot do it and keep the semantics. Just bail on any use of these intrinsics regardless of a register accessed. |
// Check if the function can be called via device enqueue. | ||
bool addressEscapes = false; | ||
if (!F.use_empty()) { | ||
const Module *M = F.getParent(); | ||
for (const GlobalVariable &GV : M->globals()) { | ||
if (GV.hasInitializer()) { | ||
if (const Constant *Init = GV.getInitializer()) { | ||
if (isa<Function>(Init) && Init == &F) { | ||
addressEscapes = true; | ||
} | ||
} | ||
} | ||
} | ||
} | ||
|
||
if (addressEscapes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F.hasAddressTaken (but again, this whole thing would be easier to implement in attributor)
Small short living kernels may become waveslot limited. To work around the problem an optimization is proposed to convert such kernels from wave32 to wave64 automatically. These kernels shall conform to a strict set of limitations and satisfy profitability conditions.