[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037

Pierre-vh · 2025-02-27T11:12:31Z

When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform.

Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls.

We can also do the same for unary and cast operators.

This also handles permlane64 because in general, hoisting intrinsics higher in a chain of operation is better and can enable folding into users later.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

llvmbot · 2025-02-27T11:13:04Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform.

Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls.

We can also do the same for simple unary operations.

Patch is 24.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129037.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+59)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h (+3)
(added) llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll (+461)
(added) llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll (+143)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index ebc00e59584ac..c2a977d7168fe 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -481,6 +481,59 @@ bool GCNTTIImpl::simplifyDemandedLaneMaskArg(InstCombiner &IC,
   return false;
 }
 
+Instruction *GCNTTIImpl::hoistReadLaneThroughOperand(InstCombiner &IC,
+                                                     IntrinsicInst &II) const {
+  Instruction *Op = dyn_cast<Instruction>(II.getOperand(0));
+
+  // Only do this if both instructions are in the same block
+  // (so the exec mask won't change) and the readlane is the only user of its
+  // operand.
+  if (!Op || !Op->hasOneUser() || Op->getParent() != II.getParent())
+    return nullptr;
+
+  const bool IsReadLane = (II.getIntrinsicID() == Intrinsic::amdgcn_readlane);
+
+  // If this is a readlane, check that the second operand is a constant, or is
+  // defined before Op so we know it's safe to move this intrinsic higher.
+  Value *LaneID = nullptr;
+  if (IsReadLane) {
+    LaneID = II.getOperand(1);
+    if (!isa<Constant>(LaneID) && !(isa<Instruction>(LaneID) &&
+                                    cast<Instruction>(LaneID)->comesBefore(Op)))
+      return nullptr;
+  }
+
+  const auto DoIt = [&](unsigned OpIdx) -> Instruction * {
+    SmallVector<Value *, 2> Ops{Op->getOperand(OpIdx)};
+    if (IsReadLane)
+      Ops.push_back(LaneID);
+
+    Instruction *NewII =
+        IC.Builder.CreateIntrinsic(II.getType(), II.getIntrinsicID(), Ops);
+
+    Instruction &NewOp = *Op->clone();
+    NewOp.setOperand(OpIdx, NewII);
+    return &NewOp;
+  };
+
+  // TODO: Are any operations more expensive on the SALU than VALU, and thus
+  //       need to be excluded here?
+
+  if (isa<UnaryOperator>(Op))
+    return DoIt(0);
+
+  if (isa<BinaryOperator>(Op)) {
+    // FIXME: If we had access to UniformityInfo here we could just check
+    // if the operand is uniform.
+    if (isTriviallyUniform(Op->getOperandUse(0)))
+      return DoIt(1);
+    if (isTriviallyUniform(Op->getOperandUse(1)))
+      return DoIt(0);
+  }
+
+  return nullptr;
+}
+
 std::optional<Instruction *>
 GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
   Intrinsic::ID IID = II.getIntrinsicID();
@@ -1128,6 +1181,12 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
         simplifyDemandedLaneMaskArg(IC, II, 1))
       return &II;
 
+    // If the readfirstlane reads the result of an operation that exists
+    // both in the SALU and VALU, we may be able to hoist it higher in order
+    // to scalarize the expression.
+    if (Instruction *Res = hoistReadLaneThroughOperand(IC, II))
+      return Res;
+
     return std::nullopt;
   }
   case Intrinsic::amdgcn_writelane: {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
index a0d62008d9ddc..4f1ae82739d16 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
@@ -224,6 +224,9 @@ class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
   bool simplifyDemandedLaneMaskArg(InstCombiner &IC, IntrinsicInst &II,
                                    unsigned LaneAgIdx) const;
 
+  Instruction *hoistReadLaneThroughOperand(InstCombiner &IC,
+                                           IntrinsicInst &II) const;
+
   std::optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
                                                     IntrinsicInst &II) const;
   std::optional<Value *> simplifyDemandedVectorEltsIntrinsic(
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll
new file mode 100644
index 0000000000000..9f27fda591382
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readfirstlane.ll
@@ -0,0 +1,461 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1030 -passes=instcombine -S < %s | FileCheck %s
+
+; test unary
+
+define float @hoist_fneg_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fneg_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fneg float [[TMP0]]
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fneg float %arg
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define double @hoist_fneg_f64(double %arg) {
+; CHECK-LABEL: define double @hoist_fneg_f64(
+; CHECK-SAME: double [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call double @llvm.amdgcn.readfirstlane.f64(double [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fneg double [[TMP0]]
+; CHECK-NEXT:    ret double [[RFL]]
+;
+bb:
+  %val = fneg double %arg
+  %rfl = call double @llvm.amdgcn.readfirstlane.f64(double %val)
+  ret double %rfl
+}
+
+; test binary i32
+
+define i32 @hoist_add_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_add_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = add i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = add i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_fadd_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fadd_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fadd float [[TMP0]], 1.280000e+02
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fadd float %arg, 128.0
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define i32 @hoist_sub_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sub_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = add i32 [[TMP0]], -16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = sub i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_fsub_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fsub_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fadd float [[TMP0]], -1.280000e+02
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fsub float %arg, 128.0
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define i32 @hoist_mul_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_mul_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = mul i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = mul i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_fmul_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fmul_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fmul float [[TMP0]], 1.280000e+02
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fmul float %arg, 128.0
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define i32 @hoist_udiv_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_udiv_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = udiv i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = udiv i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_sdiv_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sdiv_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = sdiv i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = sdiv i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_fdiv_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_fdiv_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fmul float [[TMP0]], 7.812500e-03
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fdiv float %arg, 128.0
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define i32 @hoist_urem_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_urem_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = urem i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = urem i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_srem_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_srem_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = srem i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = srem i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_frem_f32(float %arg) {
+; CHECK-LABEL: define float @hoist_frem_f32(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = frem float [[TMP0]], 1.280000e+02
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = frem float %arg, 128.0
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+define i32 @hoist_shl_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_shl_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = shl i32 [[TMP0]], 4
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = shl i32 %arg, 4
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_lshr_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_lshr_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = lshr i32 [[TMP0]], 4
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = lshr i32 %arg, 4
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_ashr_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_ashr_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = ashr i32 [[TMP0]], 4
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = ashr i32 %arg, 4
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+
+define i32 @hoist_and_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_and_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = and i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = and i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_or_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_or_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = or i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = or i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_xor_i32(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_xor_i32(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = xor i32 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = xor i32 %arg, 16777215
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+; test binary i64
+
+define i64 @hoist_and_i64(i64 %arg) {
+; CHECK-LABEL: define i64 @hoist_and_i64(
+; CHECK-SAME: i64 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.amdgcn.readfirstlane.i64(i64 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = and i64 [[TMP0]], 16777215
+; CHECK-NEXT:    ret i64 [[RFL]]
+;
+bb:
+  %val = and i64 %arg, 16777215
+  %rfl = call i64 @llvm.amdgcn.readfirstlane.i32(i64 %val)
+  ret i64 %rfl
+}
+
+define double @hoist_fadd_f64(double %arg) {
+; CHECK-LABEL: define double @hoist_fadd_f64(
+; CHECK-SAME: double [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call double @llvm.amdgcn.readfirstlane.f64(double [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fadd double [[TMP0]], 1.280000e+02
+; CHECK-NEXT:    ret double [[RFL]]
+;
+bb:
+  %val = fadd double %arg, 128.0
+  %rfl = call double @llvm.amdgcn.readfirstlane.f64(double %val)
+  ret double %rfl
+}
+
+; test constant on LHS
+
+define i32 @hoist_sub_i32_lhs(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_sub_i32_lhs(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = sub i32 16777215, [[TMP0]]
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = sub i32 16777215, %arg
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define float @hoist_fsub_f32_lhs(float %arg) {
+; CHECK-LABEL: define float @hoist_fsub_f32_lhs(
+; CHECK-SAME: float [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.amdgcn.readfirstlane.f32(float [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = fsub float 1.280000e+02, [[TMP0]]
+; CHECK-NEXT:    ret float [[RFL]]
+;
+bb:
+  %val = fsub float 128.0, %arg
+  %rfl = call float @llvm.amdgcn.readfirstlane.f32(float %val)
+  ret float %rfl
+}
+
+; test other operand is trivially uniform
+
+define i32 @hoist_add_i32_trivially_uniform_rhs(i32 %arg, i32 %v.other) {
+; CHECK-LABEL: define i32 @hoist_add_i32_trivially_uniform_rhs(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[V_OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[OTHER:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[V_OTHER]])
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = add i32 [[TMP0]], [[OTHER]]
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %other = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %v.other)
+  %val = add i32 %arg, %other
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @hoist_add_i32_trivially_uniform_lhs(i32 %arg, i32 %v.other) {
+; CHECK-LABEL: define i32 @hoist_add_i32_trivially_uniform_lhs(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[V_OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[OTHER:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[V_OTHER]])
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[RFL:%.*]] = sub i32 [[OTHER]], [[TMP0]]
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %other = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %v.other)
+  %val = sub i32 %other, %arg
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+; test multiple iterations
+
+define i32 @hoist_multiple_times(i32 %arg) {
+; CHECK-LABEL: define i32 @hoist_multiple_times(
+; CHECK-SAME: i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[ARG]])
+; CHECK-NEXT:    [[TMP1:%.*]] = shl i32 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP2:%.*]] = sub i32 16777215, [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = xor i32 [[TMP2]], 4242
+; CHECK-NEXT:    [[RFL:%.*]] = add i32 [[TMP3]], 6
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val.0 = shl i32 %arg, 2
+  %val.1 = sub i32 16777215, %val.0
+  %val.2 = xor i32 %val.1, 4242
+  %val.3 = add i32 %val.2, 6
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val.3)
+  ret i32 %rfl
+}
+
+; test cases where hoisting isn't possible
+
+define i32 @cross_block_hoisting(i1 %cond, i32 %arg) {
+; CHECK-LABEL: define i32 @cross_block_hoisting(
+; CHECK-SAME: i1 [[COND:%.*]], i32 [[ARG:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*]]:
+; CHECK-NEXT:    [[VAL:%.*]] = add i32 [[ARG]], 16777215
+; CHECK-NEXT:    br i1 [[COND]], label %[[THEN:.*]], label %[[END:.*]]
+; CHECK:       [[THEN]]:
+; CHECK-NEXT:    [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT:    br label %[[END]]
+; CHECK:       [[END]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ [[RFL]], %[[THEN]] ], [ [[VAL]], %[[BB]] ]
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+bb:
+  %val = add i32 %arg, 16777215
+  br i1 %cond, label %then, label %end
+
+then:
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  br label %end
+
+end:
+  %res = phi i32 [%rfl, %then], [%val, %bb]
+  ret i32 %res
+}
+
+define i32 @operand_is_instr(i32 %arg, ptr %src) {
+; CHECK-LABEL: define i32 @operand_is_instr(
+; CHECK-SAME: i32 [[ARG:%.*]], ptr [[SRC:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[OTHER:%.*]] = load i32, ptr [[SRC]], align 4
+; CHECK-NEXT:    [[VAL:%.*]] = add i32 [[ARG]], [[OTHER]]
+; CHECK-NEXT:    [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %other = load i32, ptr %src
+  %val = add i32 %arg, %other
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
+
+define i32 @operand_is_arg(i32 %arg, i32 %other) {
+; CHECK-LABEL: define i32 @operand_is_arg(
+; CHECK-SAME: i32 [[ARG:%.*]], i32 [[OTHER:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[BB:.*:]]
+; CHECK-NEXT:    [[VAL:%.*]] = add i32 [[ARG]], [[OTHER]]
+; CHECK-NEXT:    [[RFL:%.*]] = call i32 @llvm.amdgcn.readfirstlane.i32(i32 [[VAL]])
+; CHECK-NEXT:    ret i32 [[RFL]]
+;
+bb:
+  %val = add i32 %arg, %other
+  %rfl = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %val)
+  ret i32 %rfl
+}
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll
new file mode 100644
index 0000000000000..6ac65f5c70337
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll
@@ -0,0 +1,143 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1030 -passes=instcombine -S < %s | FileCheck %s
+
+; The readfirstlane version of this test covers all the interesting cases of the
+; shared logic. This testcase focuses on readlane specific pitfalls.
+
+; test unary
+
+define float @hoist_fneg_f32(float %arg, i32 %lane) {
+; CHECK-LABEL: define float @hoist_fneg_f32(
+; CHECK-SAME: float [[ARG:%.*]], i32 [[LANE:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

jayfoad · 2025-02-27T11:44:37Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  // TODO: Are any operations more expensive on the SALU than VALU, and thus
+  //       need to be excluded here?
+
+  if (isa<UnaryOperator>(Op))


If you also test for CastInst, will this subsume #128494?

Sure, but I'm not sure if the same transform can always be done for permlane64
@arsenm - can I just add permlane64 without changing anything ?
Then I can expand the transform to allow all cast operations too

It seems safe for permlane64 but why is it desirable? For readlane/readfirstlane it moves more stuff from VALU to SALU, but for permlane64 that reason does not apply.

I asked Matt and it's because we also do this transform on other unary intrinsics, so we have no reason to not do it here too.
I added permlane64 so this patch subsumes #128494 now

I still don't get it. Why is:

(unaryintrinsic (unaryop X)) --> (unaryop (unaryintrinsic X))

a good thing in general? Is there some general principle in InstCombine that we should do this?

Do you have an example of InstCombine doing "this"?

Actually it mostly does the opposite

I suggest dropping permlane64 from this patch, unless someone can explain why it makes sense.

@arsenm Should I keep permlane in the patch or not?

Drop it for now

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.permlane64.ll

jayfoad · 2025-03-03T10:54:25Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  // TODO: Are any operations more expensive on the SALU than VALU, and thus
+  //       need to be excluded here?
+
+  if (isa<UnaryOperator>(Op))


I still don't get it. Why is:

(unaryintrinsic (unaryop X)) --> (unaryop (unaryintrinsic X))

a good thing in general? Is there some general principle in InstCombine that we should do this?

SIFoldOperand version of llvm#129037 Handles a limited amount of opcodes because going from VALU to SALU isn't trivial, and we don't have a helper for it. I looked at our test suite and added all opcodes that were eligible and appeared as v_read(first)lane operands.

shiltian · 2025-03-11T17:07:51Z

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll

+; CHECK-NEXT:    ret i64 [[TMP0]]
+;
+bb:
+  %val = zext i32 %arg to i64


I wonder what is the benefit of doing so?

The combine in general, or only the zext case?
I think it'lll be beneficial in general because zext can be translated as a and of the source and a mask, e.g. a zext from i8 to i32 can be s_and_b32 s0, s0, 0xFF I think.
So we'd go from a VALU to SALU and

jayfoad · 2025-03-14T14:08:58Z

[AMDGPU] Hoist readlane/readfirst through unary/binary operands

Should mention permlane if the patch handles it.

Pierre-vh · 2025-03-24T12:22:56Z

[AMDGPU] Hoist readlane/readfirst through unary/binary operands

Should mention permlane if the patch handles it.

I updated the title and description

Pierre-vh · 2025-04-07T08:13:26Z

Ping

Pierre-vh · 2025-04-30T08:55:11Z

Ping

arsenm · 2025-04-30T09:40:19Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    if (!IC.getDominatorTree().dominates(LaneID, Op))
+      return nullptr;


Use isSafeToMoveBefore? I'm somewhat surprised you can do a dominates check on a constant

isSafeToMoveBefore doesn't work here, I only need to check if the LaneID is usable at the point where I want to recreate the instruction, so before OpInst

arsenm · 2025-04-30T09:50:17Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  // TODO: Are any operations more expensive on the SALU than VALU, and thus
+  //       need to be excluded here?
+
+  if (isa<UnaryOperator>(Op))


Actually it mostly does the opposite

arsenm · 2025-04-30T09:51:19Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  const auto DoIt = [&](unsigned OpIdx,
+                        Function *NewIntrinsic) -> Instruction * {


Move this to a properly named utility function, the same thing is already duplicated elsewhere

Do you mean the CreateCall + operand bundle copy bit? Or the entire lambda?
I only see one other place in the file that calls getOperandBundlesAsDefs

Yes, also the clone is also suspicious. I'd expect a Create or a clone, not both

Create is for the intrinsic call, because we may need to remangle it if we hoist through a cast
Clone is for the instruction that was previously the operand of the intrinsic

So if we have (readfirstlane (zext x)), readfirstlane is II (Re-Created), zext is OpInst (Cloned)

When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform. Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls. We can also do the same for simple unary operations.

Pierre-vh · 2025-05-01T12:03:40Z

I removed permlane handling for this patch except for the BitCastInst case which was already handled before

Pierre-vh requested review from jayfoad, arsenm, shiltian and b-sumner February 27, 2025 11:12

llvmbot added backend:AMDGPU llvm:instcombine llvm:transforms labels Feb 27, 2025

Pierre-vh commented Feb 27, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Show resolved Hide resolved

arsenm reviewed Feb 27, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

arsenm reviewed Feb 27, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.readlane.ll Show resolved Hide resolved

jayfoad reviewed Feb 27, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

Pierre-vh requested review from arsenm and jayfoad February 27, 2025 11:43

jayfoad reviewed Feb 27, 2025

View reviewed changes

Pierre-vh force-pushed the ic-readlane-hoist branch from 0a5542f to 66f2817 Compare March 3, 2025 10:38

jayfoad reviewed Mar 3, 2025

View reviewed changes

Pierre-vh mentioned this pull request Mar 4, 2025

[AMDGPU][SIFoldOperand] Hoist readlane through some instructions #129687

Open

shiltian reviewed Mar 11, 2025

View reviewed changes

Pierre-vh requested a review from jayfoad March 14, 2025 11:31

Pierre-vh changed the title ~~[AMDGPU] Hoist readlane/readfirst through unary/binary operands~~ [AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands Mar 24, 2025

Pierre-vh requested a review from shiltian April 15, 2025 07:43

Pierre-vh force-pushed the ic-readlane-hoist branch from dea213e to 339de5d Compare April 30, 2025 09:31

arsenm reviewed Apr 30, 2025

View reviewed changes

Pierre-vh added 8 commits April 30, 2025 13:47

Fix comesBefore check

a756fb0

Do not touch permlane64

ed96a71

Add permlane64 + cast instructions

0a903dc

Also hoist permlane64 through binary ops

cf5731c

fix rebase issues

c59741a

Use isSafeToMoveBefore

f7e30b7

Comments

e9cbeee

Pierre-vh force-pushed the ic-readlane-hoist branch from 339de5d to e9cbeee Compare May 1, 2025 10:01

Exclude permlane64 except for bitcast

bd971cc

Pierre-vh requested a review from arsenm May 1, 2025 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037

[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037

Pierre-vh commented Feb 27, 2025 •

edited

Loading

llvmbot commented Feb 27, 2025 •

edited

Loading

jayfoad Feb 27, 2025

Pierre-vh Feb 27, 2025

jayfoad Feb 27, 2025

Pierre-vh Mar 3, 2025

jayfoad Mar 3, 2025

jayfoad Apr 30, 2025

arsenm Apr 30, 2025

jayfoad Apr 30, 2025

Pierre-vh May 1, 2025

arsenm May 1, 2025

jayfoad Mar 3, 2025

shiltian Mar 11, 2025

Pierre-vh Mar 14, 2025

jayfoad commented Mar 14, 2025

Pierre-vh commented Mar 24, 2025

Pierre-vh commented Apr 7, 2025

Pierre-vh commented Apr 30, 2025

arsenm Apr 30, 2025

Pierre-vh May 1, 2025

arsenm Apr 30, 2025

arsenm Apr 30, 2025

Pierre-vh Apr 30, 2025

arsenm May 1, 2025

Pierre-vh May 1, 2025

Pierre-vh commented May 1, 2025

		if (!IC.getDominatorTree().dominates(LaneID, Op))
		return nullptr;

		const auto DoIt = [&](unsigned OpIdx,
		Function NewIntrinsic) -> Instruction {

[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037

Are you sure you want to change the base?

[AMDGPU] Hoist permlane64/readlane/readfirstlane through unary/binary operands #129037

Conversation

Pierre-vh commented Feb 27, 2025 • edited Loading

llvmbot commented Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayfoad commented Mar 14, 2025

Pierre-vh commented Mar 24, 2025

Pierre-vh commented Apr 7, 2025

Pierre-vh commented Apr 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierre-vh commented May 1, 2025

Pierre-vh commented Feb 27, 2025 •

edited

Loading

llvmbot commented Feb 27, 2025 •

edited

Loading