[AMDGPU] Extend wave reduce intrinsics for i32 type #126469

easyonaadit · 2025-02-10T05:52:01Z

Currently, wave reduction intrinsics are supported for umin and umax operations for i32 type only.
This patch extends support for the following operations:
uadd, add, usub, sub, min, max, and, or, xor for i32 type.

Currently, wave reduction intrinsics are supported for `umin` and `umax` operations only. This patch extends support for the following operations: `uadd`, `add`, `usub`, `sub`, `min`, `max`, `and`, `or`, `xor` for `i32` type.

llvmbot · 2025-02-10T05:52:30Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-ir

Author: Aaditya (easyonaadit)

Changes

Currently, wave reduction intrinsics are supported for umin and umax operations for i32 type only.
This patch extends support for the following operations:
uadd, add, usub, sub, min, max, and, or, xor for i32 type.

Patch is 516.30 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126469.diff

24 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+8-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+10-1)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+114-10)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+23-8)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+1237)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+986)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+986)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+986)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+986)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+1286)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.uadd.ll (+1240)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+12-12)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.usub.ll (+1286)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+1290)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.add.mir (+90)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.and.mir (+89)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.max.mir (+89)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.min.mir (+89)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.or.mir (+89)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.sub.mir (+92)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.umax.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.umin.mir (+2-2)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.reduce.xor.mir (+92)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index eb7bde69994913c..fac4228d3bc1fe7 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2327,8 +2327,14 @@ class AMDGPUWaveReduce<LLVMType data_ty = llvm_anyint_ty> : Intrinsic<
     ],
     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree, ImmArg<ArgIndex<1>>]>;
 
-def int_amdgcn_wave_reduce_umin : AMDGPUWaveReduce;
-def int_amdgcn_wave_reduce_umax : AMDGPUWaveReduce;
+multiclass AMDGPUWaveReduceOps<list<string> Operations> {
+  foreach Op = Operations in { def Op : AMDGPUWaveReduce; }
+}
+
+defvar Operations = [
+  "umin", "min", "umax", "max", "uadd", "add", "usub", "sub", "and", "or", "xor"
+];
+defm int_amdgcn_wave_reduce_ : AMDGPUWaveReduceOps<Operations>;
 
 def int_amdgcn_readfirstlane :
   Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 2e5f42c3bdc405c..7d8fb718a88eda3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4981,8 +4981,17 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[2] = AMDGPU::getValueMapping(MaskBank, MaskSize);
       break;
     }
+    case Intrinsic::amdgcn_wave_reduce_add:
+    case Intrinsic::amdgcn_wave_reduce_uadd:
+    case Intrinsic::amdgcn_wave_reduce_sub:
+    case Intrinsic::amdgcn_wave_reduce_usub:
+    case Intrinsic::amdgcn_wave_reduce_min:
     case Intrinsic::amdgcn_wave_reduce_umin:
-    case Intrinsic::amdgcn_wave_reduce_umax: {
+    case Intrinsic::amdgcn_wave_reduce_max:
+    case Intrinsic::amdgcn_wave_reduce_umax:
+    case Intrinsic::amdgcn_wave_reduce_and:
+    case Intrinsic::amdgcn_wave_reduce_or:
+    case Intrinsic::amdgcn_wave_reduce_xor: {
       unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
       OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, DstSize);
       unsigned OpSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b632c50dae0e359..2f239ddecc79371 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -4940,6 +4940,28 @@ static MachineBasicBlock *emitIndirectDst(MachineInstr &MI,
   return LoopBB;
 }
 
+static uint32_t getInitialValueForWaveReduction(unsigned Opc) {
+  switch (Opc) {
+  case AMDGPU::S_MIN_U32:
+    return std::numeric_limits<uint32_t>::max();
+  case AMDGPU::S_MIN_I32:
+    return std::numeric_limits<int32_t>::max();
+  case AMDGPU::S_MAX_U32:
+    return std::numeric_limits<uint32_t>::min();
+  case AMDGPU::S_MAX_I32:
+    return std::numeric_limits<int32_t>::min();
+  case AMDGPU::S_ADD_I32:
+  case AMDGPU::S_SUB_I32:
+  case AMDGPU::S_OR_B32:
+  case AMDGPU::S_XOR_B32:
+    return std::numeric_limits<uint32_t>::min();
+  case AMDGPU::S_AND_B32:
+    return std::numeric_limits<uint32_t>::max();
+  default:
+    llvm_unreachable("Unexpected opcode in getInitialValueForWaveReduction");
+  }
+}
+
 static MachineBasicBlock *lowerWaveReduce(MachineInstr &MI,
                                           MachineBasicBlock &BB,
                                           const GCNSubtarget &ST,
@@ -4955,13 +4977,78 @@ static MachineBasicBlock *lowerWaveReduce(MachineInstr &MI,
   Register DstReg = MI.getOperand(0).getReg();
   MachineBasicBlock *RetBB = nullptr;
   if (isSGPR) {
-    // These operations with a uniform value i.e. SGPR are idempotent.
-    // Reduced value will be same as given sgpr.
-    // clang-format off
-    BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MOV_B32), DstReg)
-        .addReg(SrcReg);
-    // clang-format on
-    RetBB = &BB;
+    switch (Opc) {
+    case AMDGPU::S_MIN_U32:
+    case AMDGPU::S_MIN_I32:
+    case AMDGPU::S_MAX_U32:
+    case AMDGPU::S_MAX_I32:
+    case AMDGPU::S_AND_B32:
+    case AMDGPU::S_OR_B32: {
+      // Idempotent operations.
+      BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MOV_B32), DstReg).addReg(SrcReg);
+      RetBB = &BB;
+      break;
+    }
+    case AMDGPU::S_XOR_B32:
+    case AMDGPU::S_ADD_I32:
+    case AMDGPU::S_SUB_I32: {
+      const TargetRegisterClass *WaveMaskRegClass = TRI->getWaveMaskRegClass();
+      const TargetRegisterClass *DstRegClass = MRI.getRegClass(DstReg);
+      Register ExecMask = MRI.createVirtualRegister(WaveMaskRegClass);
+      Register ActiveLanes = MRI.createVirtualRegister(DstRegClass);
+
+      bool IsWave32 = ST.isWave32();
+      unsigned MovOpc = IsWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
+      unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
+      unsigned CountReg =
+          IsWave32 ? AMDGPU::S_BCNT1_I32_B32 : AMDGPU::S_BCNT1_I32_B64;
+
+      auto Exec =
+          BuildMI(BB, MI, DL, TII->get(MovOpc), ExecMask).addReg(ExecReg);
+
+      auto NewAccumulator = BuildMI(BB, MI, DL, TII->get(CountReg), ActiveLanes)
+                                .addReg(Exec->getOperand(0).getReg());
+
+      switch (Opc) {
+      case AMDGPU::S_XOR_B32: {
+        // Performing an XOR operation on a uniform value
+        // depends on the parity of the number of active lanes.
+        // For even parity, the result will be 0, for odd
+        // parity the result will be the same as the input value.
+        Register ParityRegister = MRI.createVirtualRegister(DstRegClass);
+
+        auto ParityReg =
+            BuildMI(BB, MI, DL, TII->get(AMDGPU::S_AND_B32), ParityRegister)
+                .addReg(NewAccumulator->getOperand(0).getReg())
+                .addImm(1);
+        BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MUL_I32), DstReg)
+            .addReg(SrcReg)
+            .addReg(ParityReg->getOperand(0).getReg());
+        break;
+      }
+      case AMDGPU::S_SUB_I32: {
+        Register NegatedVal = MRI.createVirtualRegister(DstRegClass);
+
+        // Take the negation of the source operand.
+        auto InvertedValReg =
+            BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MUL_I32), NegatedVal)
+                .addImm(-1)
+                .addReg(SrcReg);
+        BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MUL_I32), DstReg)
+            .addReg(InvertedValReg->getOperand(0).getReg())
+            .addReg(NewAccumulator->getOperand(0).getReg());
+        break;
+      }
+      case AMDGPU::S_ADD_I32: {
+        BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MUL_I32), DstReg)
+            .addReg(SrcReg)
+            .addReg(NewAccumulator->getOperand(0).getReg());
+        break;
+      }
+      }
+      RetBB = &BB;
+    }
+    }
   } else {
     // TODO: Implement DPP Strategy and switch based on immediate strategy
     // operand. For now, for all the cases (default, Iterative and DPP we use
@@ -4997,9 +5084,8 @@ static MachineBasicBlock *lowerWaveReduce(MachineInstr &MI,
     unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
 
     // Create initail values of induction variable from Exec, Accumulator and
-    // insert branch instr to newly created ComputeBlockk
-    uint32_t InitalValue =
-        (Opc == AMDGPU::S_MIN_U32) ? std::numeric_limits<uint32_t>::max() : 0;
+    // insert branch instr to newly created ComputeBlock
+    uint32_t InitalValue = getInitialValueForWaveReduction(Opc);
     auto TmpSReg =
         BuildMI(BB, I, DL, TII->get(MovOpc), LoopIterator).addReg(ExecReg);
     BuildMI(BB, I, DL, TII->get(AMDGPU::S_MOV_B32), InitalValReg)
@@ -5071,8 +5157,26 @@ SITargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
   switch (MI.getOpcode()) {
   case AMDGPU::WAVE_REDUCE_UMIN_PSEUDO_U32:
     return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_MIN_U32);
+  case AMDGPU::WAVE_REDUCE_MIN_PSEUDO_I32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_MIN_I32);
   case AMDGPU::WAVE_REDUCE_UMAX_PSEUDO_U32:
     return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_MAX_U32);
+  case AMDGPU::WAVE_REDUCE_MAX_PSEUDO_I32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_MAX_I32);
+  case AMDGPU::WAVE_REDUCE_UADD_PSEUDO_U32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_ADD_I32);
+  case AMDGPU::WAVE_REDUCE_ADD_PSEUDO_I32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_ADD_I32);
+  case AMDGPU::WAVE_REDUCE_USUB_PSEUDO_U32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_SUB_I32);
+  case AMDGPU::WAVE_REDUCE_SUB_PSEUDO_I32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_SUB_I32);
+  case AMDGPU::WAVE_REDUCE_AND_PSEUDO_B32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_AND_B32);
+  case AMDGPU::WAVE_REDUCE_OR_PSEUDO_B32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_OR_B32);
+  case AMDGPU::WAVE_REDUCE_XOR_PSEUDO_B32:
+    return lowerWaveReduce(MI, *BB, *getSubtarget(), AMDGPU::S_XOR_B32);
   case AMDGPU::S_UADDO_PSEUDO:
   case AMDGPU::S_USUBO_PSEUDO: {
     const DebugLoc &DL = MI.getDebugLoc();
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 6e08aff24ec23af..8a3e93a4faa956f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -277,16 +277,31 @@ def : GCNPat <(vt (int_amdgcn_set_inactive vt:$src, vt:$inactive)),
 def : GCNPat<(i32 (int_amdgcn_set_inactive_chain_arg i32:$src, i32:$inactive)),
     (V_SET_INACTIVE_B32 0, VGPR_32:$src, 0, VGPR_32:$inactive, (IMPLICIT_DEF))>;
 
-let usesCustomInserter = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC] in {
-  def WAVE_REDUCE_UMIN_PSEUDO_U32 : VPseudoInstSI <(outs SGPR_32:$sdst),
-    (ins VSrc_b32: $src, VSrc_b32:$strategy),
-    [(set i32:$sdst, (int_amdgcn_wave_reduce_umin i32:$src, i32:$strategy))]> {
+// clang-format off
+defvar int_amdgcn_wave_reduce_ = "int_amdgcn_wave_reduce_";
+multiclass
+    AMDGPUWaveReducePseudoGenerator<string Op, string DataType, string Size> {
+  let usesCustomInserter = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC] in {
+    def !toupper(Op) #"_PSEUDO_" #DataType #Size
+        : VPseudoInstSI<(outs SGPR_32 : $sdst),
+                        (ins VSrc_b32 : $src, VSrc_b32 : $strategy),
+                        [(set i32 : $sdst, (!cast<AMDGPUWaveReduce>(int_amdgcn_wave_reduce_ #Op) i32 : $src, i32 : $strategy))]> {}
   }
+}
+// clang-format on
 
-  def WAVE_REDUCE_UMAX_PSEUDO_U32 : VPseudoInstSI <(outs SGPR_32:$sdst),
-    (ins VSrc_b32: $src, VSrc_b32:$strategy),
-    [(set i32:$sdst, (int_amdgcn_wave_reduce_umax i32:$src, i32:$strategy))]> {
-  }
+// Input list : [Operation_name,
+//              type - Signed(I)/Unsigned(U)/Float(F)/Bitwise(B),
+//              Size_in_bits]
+defvar Operations = [
+  ["umin", "U", "32"], ["min", "I", "32"], ["umax", "U", "32"],
+  ["max", "I", "32"], ["uadd", "U", "32"], ["add", "I", "32"],
+  ["usub", "U", "32"], ["sub", "I", "32"], ["and", "B", "32"],
+  ["or", "B", "32"], ["xor", "B", "32"]
+];
+
+foreach Op = Operations in {
+  defm WAVE_REDUCE_ : AMDGPUWaveReducePseudoGenerator<Op[0], Op[1], Op[2]>;
 }
 
 let usesCustomInserter = 1, Defs = [VCC] in {
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll
new file mode 100644
index 000000000000000..e93e8d4108ba0a9
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll
@@ -0,0 +1,1237 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=0 < %s | FileCheck  -check-prefixes=GFX8DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=1 < %s | FileCheck  -check-prefixes=GFX8GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck  -check-prefixes=GFX9DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 < %s | FileCheck  -check-prefixes=GFX9GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1064DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1064GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1032DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1032GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1164DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1164GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1132DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1132GISEL %s
+
+define amdgpu_kernel void @uniform_value(ptr addrspace(1) %out, i32 %in) {
+; GFX8DAGISEL-LABEL: uniform_value:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: uniform_value:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: uniform_value:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX9DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: uniform_value:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX1064DAGISEL-LABEL: uniform_value:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_clause 0x1
+; GFX1064DAGISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1064DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_endpgm
+;
+; GFX1064GISEL-LABEL: uniform_value:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_clause 0x1
+; GFX1064GISEL-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064GISEL-NEXT:    s_endpgm
+;
+; GFX1032DAGISEL-LABEL: uniform_value:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_clause 0x1
+; GFX1032DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_mul_i32 s2, s2, s3
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1032DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1032DAGISEL-NEXT:    s_endpgm
+;
+; GFX1032GISEL-LABEL: uniform_value:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_clause 0x1
+; GFX1032GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032GISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_mul_i32 s2, s2, s3
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: uniform_value:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_clause 0x1
+; GFX1164DAGISEL-NEXT:    s_load_b32 s6, s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: uniform_value:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_clause 0x1
+; GFX1164GISEL-NEXT:    s_load_b32 s6, s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_mul_i32 s2, s6, s2
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: uniform_value:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_clause 0x1
+; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_mul_i32 s2, s2, s3
+; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: uniform_value:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_clause 0x1
+; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_mul_i32 s2, s2, s3
+; GFX1132GISEL-...
[truncated]

arsenm

We need clang builtins and unit tests to ensure these really work

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

easyonaadit · 2025-02-10T08:20:55Z

We need clang builtins and unit tests to ensure these really work

The long term plan is to replace the CodeGen currently done in AtomicOptimizer, with these intrinsics.
Are the builtins required just for testing purposes, or do we need to expose them to the user?

arsenm · 2025-02-10T08:28:40Z

Are the builtins required just for testing purposes, or do we need to expose them to the user?

That's the initial use. I don't have any plan to push these into a public header, but they should be there as clang builtins

easyonaadit · 2025-02-10T08:35:24Z

That's the initial use. I don't have any plan to push these into a public header, but they should be there as clang builtins

Okay sure I'll do that.

jhuber6

Do these implement the prefix sum / scan? Given that the last value of the scan is the reduction I'm assuming that's the case.

llvm/lib/Target/AMDGPU/SIInstructions.td

jhuber6 · 2025-02-10T13:42:51Z

We need clang builtins and unit tests to ensure these really work

I could put some unit tests in libc if needed. I currently test the scan helpers in gpuintrin.h there.

pravinjagtap · 2025-02-10T16:26:55Z

Do these implement the prefix sum / scan?

No, work for adding scan builtins is under development. Given class of intrinsics just performs the reductions (final reduced value will be in SGPR). You will find the use case here #121148

pravinjagtap · 2025-02-11T06:10:57Z

We need clang builtins and unit tests to ensure these really work

The code for reduction works (at least for max op). The only user that we have for this intrinsic at the moment is dyn alloca. The hip-tests for dyn alloca will suffice to ensure that it is correct ?

easyonaadit · 2025-02-13T05:31:08Z

@arsenm
I have stacked a PR on this to add the builtins: #127013

easyonaadit · 2025-02-18T02:55:04Z

Pinging
@arsenm @pravinjagtap @jhuber6

easyonaadit · 2025-02-21T05:55:04Z

We need clang builtins and unit tests to ensure these really work

I could put some unit tests in libc if needed. I currently test the scan helpers in gpuintrin.h there.

Are we looking for tests to check the codegen only or do we need execution tests?

jhuber6 · 2025-02-21T13:03:59Z

We need clang builtins and unit tests to ensure these really work

I could put some unit tests in libc if needed. I currently test the scan helpers in gpuintrin.h there.

Are we looking for tests to check the codegen only or do we need execution tests?

I have runtime tests in https://github.com/llvm/llvm-project/blob/main/libc/test/integration/src/__support/GPU/scan_reduce.cpp#L74, but I'm unsure if it's a fit for this since it'd need to work for NVIDIA as well. I suppose those will get tested automatically once I change the definitions in gpuintrin.h to use these builtins for AMD.

easyonaadit · 2025-03-03T16:17:26Z

ping.

jhuber6

This looks reasonable to me, but I guess I'll defer the ultimate decision to people more familiar with the backend.

A more general question, are there any major barriers left to using the DPP lowering with whole-wave mode by default? I understand there were some previous issues with it, but many of those might be resolved now.

Also, I'm greatly looking forward to having clang builtins for these as well as the scan operator. Thanks for working on these.

easyonaadit · 2025-03-04T16:12:14Z

A more general question, are there any major barriers left to using the DPP lowering with whole-wave mode by default? I understand there were some previous issues with it, but many of those might be resolved now.

I plan to implement the DPP lowering in a follow up patch.
As for the issues with the whole-wave mode, I'm not sure if all of them have been resolved. I'll do some digging into that.

Also, I'm greatly looking forward to having clang builtins for these as well as the scan operator. Thanks for working on these.

Sure!

krzysz00 · 2025-03-27T16:33:31Z

I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".

Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself

jhuber6 · 2025-03-27T17:04:39Z

I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".

Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself

That's basically the width argument to the CUDA __shfl intrinsic, right? We could reasonably add that as an argument to this.

pravinjagtap · 2025-04-24T04:35:12Z

I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".
Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself

That's basically the width argument to the CUDA __shfl intrinsic, right? We could reasonably add that as an argument to this.

Right, unsigned __reduce_add_sync(unsigned mask, unsigned value) builtins are already implemented. Here mask is divergent value which represents subgroups in a wave and performs reduction within this subgroups.

arsenm · 2025-04-27T08:38:13Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+defvar Operations = [
+  "umin", "min", "umax", "max", "uadd", "add", "usub", "sub", "and", "or", "xor"
+];


Might as well inline this into the multi class and avoid the class parameter

arsenm · 2025-04-27T08:38:37Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+
+      bool IsWave32 = ST.isWave32();
+      unsigned MovOpc = IsWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
+      unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;


Suggested change

unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

MCRegister ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

arsenm · 2025-04-27T08:38:57Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

@@ -4997,9 +5084,8 @@ static MachineBasicBlock *lowerWaveReduce(MachineInstr &MI,
    unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

    // Create initail values of induction variable from Exec, Accumulator and


Typo initail

arsenm · 2025-04-27T08:39:49Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  ["umin", "U", "32"], ["min", "I", "32"], ["umax", "U", "32"],
+  ["max", "I", "32"], ["uadd", "U", "32"], ["add", "I", "32"],
+  ["usub", "U", "32"], ["sub", "I", "32"], ["and", "B", "32"],
+  ["or", "B", "32"], ["xor", "B", "32"]


I don't understand why you're splitting the "U" and "32", just treat it as one thing

arsenm · 2025-05-05T13:37:35Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    case AMDGPU::S_OR_B32: {
+      // Idempotent operations.
+      BuildMI(BB, MI, DL, TII->get(AMDGPU::S_MOV_B32), DstReg).addReg(SrcReg);
+      RetBB = &BB;


I don't really follow what RetBB is for, it's always identical to BB?

[AMDGPU] Wave Reduce Intrinsics for i32 type

29bb614

Currently, wave reduction intrinsics are supported for `umin` and `umax` operations only. This patch extends support for the following operations: `uadd`, `add`, `usub`, `sub`, `min`, `max`, `and`, `or`, `xor` for `i32` type.

llvmbot added backend:AMDGPU llvm:ir labels Feb 10, 2025

easyonaadit requested review from jayfoad, arsenm, pravinjagtap and cdevadas February 10, 2025 05:52

arsenm reviewed Feb 10, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Show resolved Hide resolved

pravinjagtap reviewed Feb 10, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstructions.td Show resolved Hide resolved

easyonaadit changed the title ~~[AMDGPU] Wave Reduce Intrinsics for i32 type~~ [AMDGPU] Extend wave reduce intrinsics for i32 type Feb 10, 2025

jhuber6 reviewed Feb 10, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstructions.td Show resolved Hide resolved

Renamed function to getIdentityValueForWaveReduction

48783dc

easyonaadit force-pushed the wave-reduce-intrinsics-all-ops branch from 2766e6c to 48783dc Compare February 12, 2025 11:53

easyonaadit mentioned this pull request Feb 13, 2025

[AMDGPU] Add builtins for wave reduction intrinsics #127013

Open

jhuber6 approved these changes Mar 3, 2025

View reviewed changes

arsenm mentioned this pull request Mar 27, 2025

[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs #133204

Merged

arsenm reviewed Apr 27, 2025

View reviewed changes

arsenm reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Extend wave reduce intrinsics for i32 type #126469

[AMDGPU] Extend wave reduce intrinsics for i32 type #126469

easyonaadit commented Feb 10, 2025

llvmbot commented Feb 10, 2025 •

edited

Loading

arsenm left a comment

easyonaadit commented Feb 10, 2025 •

edited

Loading

arsenm commented Feb 10, 2025

easyonaadit commented Feb 10, 2025

jhuber6 left a comment

jhuber6 commented Feb 10, 2025

pravinjagtap commented Feb 10, 2025

pravinjagtap commented Feb 11, 2025

easyonaadit commented Feb 13, 2025

easyonaadit commented Feb 18, 2025

easyonaadit commented Feb 21, 2025

jhuber6 commented Feb 21, 2025

easyonaadit commented Mar 3, 2025

jhuber6 left a comment

easyonaadit commented Mar 4, 2025

krzysz00 commented Mar 27, 2025

jhuber6 commented Mar 27, 2025

pravinjagtap commented Apr 24, 2025

arsenm Apr 27, 2025

arsenm Apr 27, 2025

arsenm Apr 27, 2025

arsenm Apr 27, 2025

arsenm May 5, 2025

	unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
	MCRegister ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

		@@ -4997,9 +5084,8 @@ static MachineBasicBlock *lowerWaveReduce(MachineInstr &MI,
		unsigned ExecReg = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

		// Create initail values of induction variable from Exec, Accumulator and

[AMDGPU] Extend wave reduce intrinsics for i32 type #126469

Are you sure you want to change the base?

[AMDGPU] Extend wave reduce intrinsics for i32 type #126469

Conversation

easyonaadit commented Feb 10, 2025

llvmbot commented Feb 10, 2025 • edited Loading

arsenm left a comment

Choose a reason for hiding this comment

easyonaadit commented Feb 10, 2025 • edited Loading

arsenm commented Feb 10, 2025

easyonaadit commented Feb 10, 2025

jhuber6 left a comment

Choose a reason for hiding this comment

jhuber6 commented Feb 10, 2025

pravinjagtap commented Feb 10, 2025

pravinjagtap commented Feb 11, 2025

easyonaadit commented Feb 13, 2025

easyonaadit commented Feb 18, 2025

easyonaadit commented Feb 21, 2025

jhuber6 commented Feb 21, 2025

easyonaadit commented Mar 3, 2025

jhuber6 left a comment

Choose a reason for hiding this comment

easyonaadit commented Mar 4, 2025

krzysz00 commented Mar 27, 2025

jhuber6 commented Mar 27, 2025

pravinjagtap commented Apr 24, 2025

arsenm Apr 27, 2025

Choose a reason for hiding this comment

arsenm Apr 27, 2025

Choose a reason for hiding this comment

arsenm Apr 27, 2025

Choose a reason for hiding this comment

arsenm Apr 27, 2025

Choose a reason for hiding this comment

arsenm May 5, 2025

Choose a reason for hiding this comment

llvmbot commented Feb 10, 2025 •

edited

Loading

easyonaadit commented Feb 10, 2025 •

edited

Loading