-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesWhen a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform. Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls. We can also do the same for simple unary operations. Patch is 24.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129037.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index ebc00e59584ac..c2a977d7168fe 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -481,6 +481,59 @@ bool GCNTTIImpl::simplifyDemandedLaneMaskArg(InstCombiner &IC,
return false;
}
+Instruction *GCNTTIImpl::hoistReadLaneThroughOperand(InstCombiner &IC,
+ IntrinsicInst &II) const {
+ Instruction *Op = dyn_cast<Instruction>(II.getOperand(0));
+
+ // Only do this if both instructions are in the same block
+ // (so the exec mask won't change) and the readlane is the only user of its
+ // operand.
+ if (!Op || !Op->hasOneUser() || Op->getParent() != II.getParent())
+ return nullptr;
+
+ const bool IsReadLane = (II.getIntrinsicID() == Intrinsic::amdgcn_readlane);
+
+ // If this is a readlane, check that the second operand is a constant, or is
+ // defined before Op so we know it's safe to move this intrinsic higher.
+ Value *LaneID = nullptr;
+ if (IsReadLane) {
+ LaneID = II.getOperand(1);
+ if (!isa<Constant>(LaneID) && !(isa<Instruction>(LaneID) &&
+ cast<Instruction>(LaneID)->comesBefore(Op)))
+ return nullptr;
+ }
+
+ const auto DoIt = [&](unsigned OpIdx) -> Instruction * {
+ SmallVector<Value *, 2> Ops{Op->getOperand(OpIdx)};
+ if (IsReadLane)
+ Ops.push_back(LaneID);
+
+ Instruction *NewII =
+ IC.Builder.CreateIntrinsic(II.getType(), II.getIntrinsicID(), Ops);
+
+ Instruction &NewOp = *Op->clone();
+ NewOp.setOperand(OpIdx, NewII);
+ return &NewOp;
+ };
+
+ // TODO: Are any operations more expensive on the SALU than VALU, and thus
+ // need to be excluded here?
+
+ if (isa<UnaryOperator>(Op))
+ return DoIt(0);
+
+ if (isa<BinaryOperator>(Op)) {
+ // FIXME: If we had access to UniformityInfo here we could just check
+ // if the operand is uniform.
+ if (isTriviallyUniform(Op->getOperandUse(0)))
+ return DoIt(1);
+ if (isTriviallyUniform(Op->getOperandUse(1)))
+ return DoIt(0);
+ }
+
+ return nullptr;
+}
+
std::optional<Instruction *>
GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
Intrinsic::ID IID = II.getIntrinsicID();
@@ -1128,6 +1181,12 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
simplifyDemandedLaneMaskArg(IC, II, 1))
return &II;
+ // If the readfirstlane reads the result of an operation that exists
+ // both in the SALU and VALU, we may be able to hoist it higher in order
+ // to scalarize the expression.
+ if (Instruction *Res = hoistReadLaneThroughOperand(IC, II))
+ return Res;
+
return std::nullopt;
}
case Intrinsic::amdgcn_writelane: {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
index a0d62008d9ddc..4f1ae82739d16 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
@@ -224,6 +224,9 @@ class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
bool simplifyDemandedLaneMaskArg(InstCombiner &IC, IntrinsicInst &II,
unsigned LaneAgIdx) const;
+ Instruction *hoistReadLaneThroughOperand(InstCombiner &IC,
+ IntrinsicInst &II) const;
+
std::optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) const;
std::optional<Value *> simplifyDemandedVectorEltsIntrinsic(
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll
new file mode 100644
index 0000000000000..9f27fda591382
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll
@@ -0,0 +1,461 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1030 -passes=instcombine -S < %s | FileCheck %s
+
+; test unary
+
+define float @hoist_fneg_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fneg_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fneg float [[TMP0]]
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fneg float %arg
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define double @hoist_fneg_f64(double %arg) {
+; CHECK-LABEL: define double @hoist_fneg_f64(
+; CHECK-SAME: double [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.amdgcn.readfirstlane.f64(double [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fneg double [[TMP0]]
+; CHECK-NEXT: ret double [[RFL]]
+;
+bb:
+ %val = fneg double %arg
+ %rfl = call double @llvm.amdgcn.readfirstlane.f64(double %val)
+ ret double %rfl
+}
+
+; test binary i32
+
+define i32 @hoist_add_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_add_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = add i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = add i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_fadd_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fadd_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fadd float [[TMP0]], 1.280000e+02
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fadd float %arg, 128.0
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define i32 @hoist_sub_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sub_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = add i32 [[TMP0]], -16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = sub i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_fsub_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fsub_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fadd float [[TMP0]], -1.280000e+02
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fsub float %arg, 128.0
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define i32 @hoist_mul_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_mul_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = mul i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = mul i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_fmul_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fmul_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fmul float [[TMP0]], 1.280000e+02
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fmul float %arg, 128.0
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define i32 @hoist_udiv_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_udiv_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = udiv i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = udiv i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_sdiv_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sdiv_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = sdiv i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = sdiv i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_fdiv_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fdiv_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fmul float [[TMP0]], 7.812500e-03
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fdiv float %arg, 128.0
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define i32 @hoist_urem_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_urem_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = urem i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = urem i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_srem_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_srem_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = srem i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = srem i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_frem_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_frem_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = frem float [[TMP0]], 1.280000e+02
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = frem float %arg, 128.0
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+define i32 @hoist_shl_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_shl_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = shl i32 [[TMP0]], 4
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = shl i32 %arg, 4
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_lshr_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_lshr_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = lshr i32 [[TMP0]], 4
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = lshr i32 %arg, 4
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_ashr_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_ashr_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = ashr i32 [[TMP0]], 4
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = ashr i32 %arg, 4
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+
+define i32 @hoist_and_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_and_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = and i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = and i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_or_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_or_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = or i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = or i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_xor_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_xor_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = xor i32 [[TMP0]], 16777215
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = xor i32 %arg, 16777215
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+; test binary i64
+
+define i64 @hoist_and_i64(i64 %arg) {
+; CHECK-LABEL: define i64 @hoist_and_i64(
+; CHECK-SAME: i64 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.amdgcn.readfirstlane.i64(i64 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = and i64 [[TMP0]], 16777215
+; CHECK-NEXT: ret i64 [[RFL]]
+;
+bb:
+ %val = and i64 %arg, 16777215
+ %rfl = call i64 @llvm.amdgcn.readfirstlane.i32(i64 %val)
+ ret i64 %rfl
+}
+
+define double @hoist_fadd_f64(double %arg) {
+; CHECK-LABEL: define double @hoist_fadd_f64(
+; CHECK-SAME: double [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call double @llvm.amdgcn.readfirstlane.f64(double [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fadd double [[TMP0]], 1.280000e+02
+; CHECK-NEXT: ret double [[RFL]]
+;
+bb:
+ %val = fadd double %arg, 128.0
+ %rfl = call double @llvm.amdgcn.readfirstlane.f64(double %val)
+ ret double %rfl
+}
+
+; test constant on LHS
+
+define i32 @hoist_sub_i32_lhs(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sub_i32_lhs(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = sub i32 16777215, [[TMP0]]
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = sub i32 16777215, %arg
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define float @hoist_fsub_f32_lhs(float %arg) {
+; CHECK-LABEL: define float @hoist_fsub_f32_lhs(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = fsub float 1.280000e+02, [[TMP0]]
+; CHECK-NEXT: ret float [[RFL]]
+;
+bb:
+ %val = fsub float 128.0, %arg
+ %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+ ret float %rfl
+}
+
+; test other operand is trivially uniform
+
+define i32 @hoist_add_i32_trivially_uniform_rhs(i32 %arg, i32 %v.other) {
+; CHECK-LABEL: define i32 @hoist_add_i32_trivially_uniform_rhs(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[V_OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[OTHER:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[V_OTHER]])
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = add i32 [[TMP0]], [[OTHER]]
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %other = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %v.other)
+ %val = add i32 %arg, %other
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @hoist_add_i32_trivially_uniform_lhs(i32 %arg, i32 %v.other) {
+; CHECK-LABEL: define i32 @hoist_add_i32_trivially_uniform_lhs(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[V_OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[OTHER:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[V_OTHER]])
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[RFL:%.*]] = sub i32 [[OTHER]], [[TMP0]]
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %other = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %v.other)
+ %val = sub i32 %other, %arg
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+; test multiple iterations
+
+define i32 @hoist_multiple_times(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_multiple_times(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[TMP0]], 2
+; CHECK-NEXT: [[TMP2:%.*]] = sub i32 16777215, [[TMP1]]
+; CHECK-NEXT: [[TMP3:%.*]] = xor i32 [[TMP2]], 4242
+; CHECK-NEXT: [[RFL:%.*]] = add i32 [[TMP3]], 6
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val.0 = shl i32 %arg, 2
+ %val.1 = sub i32 16777215, %val.0
+ %val.2 = xor i32 %val.1, 4242
+ %val.3 = add i32 %val.2, 6
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val.3)
+ ret i32 %rfl
+}
+
+; test cases where hoisting isn't possible
+
+define i32 @cross_block_hoisting(i1 %cond, i32 %arg) {
+; CHECK-LABEL: define i32 @cross_block_hoisting(
+; CHECK-SAME: i1 [[COND:%.*]], i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*]]:
+; CHECK-NEXT: [[VAL:%.*]] = add i32 [[ARG]], 16777215
+; CHECK-NEXT: br i1 [[COND]], label %[[THEN:.*]], label %[[END:.*]]
+; CHECK: [[THEN]]:
+; CHECK-NEXT: [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT: br label %[[END]]
+; CHECK: [[END]]:
+; CHECK-NEXT: [[RES:%.*]] = phi i32 [ [[RFL]], %[[THEN]] ], [ [[VAL]], %[[BB]] ]
+; CHECK-NEXT: ret i32 [[RES]]
+;
+bb:
+ %val = add i32 %arg, 16777215
+ br i1 %cond, label %then, label %end
+
+then:
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ br label %end
+
+end:
+ %res = phi i32 [%rfl, %then], [%val, %bb]
+ ret i32 %res
+}
+
+define i32 @operand_is_instr(i32 %arg, ptr %src) {
+; CHECK-LABEL: define i32 @operand_is_instr(
+; CHECK-SAME: i32 [[ARG:%.*]], ptr [[SRC:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[OTHER:%.*]] = load i32, ptr [[SRC]], align 4
+; CHECK-NEXT: [[VAL:%.*]] = add i32 [[ARG]], [[OTHER]]
+; CHECK-NEXT: [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %other = load i32, ptr %src
+ %val = add i32 %arg, %other
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
+
+define i32 @operand_is_arg(i32 %arg, i32 %other) {
+; CHECK-LABEL: define i32 @operand_is_arg(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[VAL:%.*]] = add i32 [[ARG]], [[OTHER]]
+; CHECK-NEXT: [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT: ret i32 [[RFL]]
+;
+bb:
+ %val = add i32 %arg, %other
+ %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+ ret i32 %rfl
+}
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll
new file mode 100644
index 0000000000000..6ac65f5c70337
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll
@@ -0,0 +1,143 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1030 -passes=instcombine -S < %s | FileCheck %s
+
+; The readfirstlane version of this test covers all the interesting cases of the
+; shared logic. This testcase focuses on readlane specific pitfalls.
+
+; test unary
+
+define float @hoist_fneg_f32(float %arg, i32 %lane) {
+; CHECK-LABEL: define float @hoist_fneg_f32(
+; CHECK-SAME: float [[ARG:%.*]], i32 [[LANE:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK...
[truncated]
|
// TODO: Are any operations more expensive on the SALU than VALU, and thus | ||
// need to be excluded here? | ||
|
||
if (isa<UnaryOperator>(Op)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you also test for CastInst
, will this subsume #128494?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I'm not sure if the same transform can always be done for permlane64
@arsenm - can I just add permlane64 without changing anything ?
Then I can expand the transform to allow all cast operations too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems safe for permlane64 but why is it desirable? For readlane/readfirstlane it moves more stuff from VALU to SALU, but for permlane64 that reason does not apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked Matt and it's because we also do this transform on other unary intrinsics, so we have no reason to not do it here too.
I added permlane64 so this patch subsumes #128494 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get it. Why is:
(unaryintrinsic (unaryop X)) --> (unaryop (unaryintrinsic X))
a good thing in general? Is there some general principle in InstCombine that we should do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of InstCombine doing "this"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it mostly does the opposite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest dropping permlane64 from this patch, unless someone can explain why it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm Should I keep permlane in the patch or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop it for now
0a5542f
to
66f2817
Compare
llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.permlane64.ll
Outdated
Show resolved
Hide resolved
// TODO: Are any operations more expensive on the SALU than VALU, and thus | ||
// need to be excluded here? | ||
|
||
if (isa<UnaryOperator>(Op)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get it. Why is:
(unaryintrinsic (unaryop X)) --> (unaryop (unaryintrinsic X))
a good thing in general? Is there some general principle in InstCombine that we should do this?
SIFoldOperand version of llvm#129037 Handles a limited amount of opcodes because going from VALU to SALU isn't trivial, and we don't have a helper for it. I looked at our test suite and added all opcodes that were eligible and appeared as v_read(first)lane operands.
; CHECK-NEXT: ret i64 [[TMP0]] | ||
; | ||
bb: | ||
%val = zext i32 %arg to i64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is the benefit of doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combine in general, or only the zext case?
I think it'lll be beneficial in general because zext can be translated as a and
of the source and a mask, e.g. a zext from i8 to i32 can be s_and_b32 s0, s0, 0xFF
I think.
So we'd go from a VALU to SALU and
Should mention permlane if the patch handles it. |
I updated the title and description |
Ping |
Ping |
dea213e
to
339de5d
Compare
if (!IC.getDominatorTree().dominates(LaneID, Op)) | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use isSafeToMoveBefore? I'm somewhat surprised you can do a dominates check on a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSafeToMoveBefore doesn't work here, I only need to check if the LaneID is usable at the point where I want to recreate the instruction, so before OpInst
// TODO: Are any operations more expensive on the SALU than VALU, and thus | ||
// need to be excluded here? | ||
|
||
if (isa<UnaryOperator>(Op)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it mostly does the opposite
const auto DoIt = [&](unsigned OpIdx, | ||
Function *NewIntrinsic) -> Instruction * { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to a properly named utility function, the same thing is already duplicated elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the CreateCall
+ operand bundle copy bit? Or the entire lambda?
I only see one other place in the file that calls getOperandBundlesAsDefs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, also the clone is also suspicious. I'd expect a Create or a clone, not both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create is for the intrinsic call, because we may need to remangle it if we hoist through a cast
Clone is for the instruction that was previously the operand of the intrinsic
So if we have (readfirstlane (zext x))
, readfirstlane
is II
(Re-Created), zext
is OpInst
(Cloned)
When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform. Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls. We can also do the same for simple unary operations.
339de5d
to
e9cbeee
Compare
I removed permlane handling for this patch except for the BitCastInst case which was already handled before |
When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform.
Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls.
We can also do the same for unary and cast operators.
This also handles permlane64 because in general, hoisting intrinsics higher in a chain of operation is better and can enable folding into users later.