Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[AMDGPU] SIPeepholeSDWA: Handle V_CNDMASK_B32_e64 #137930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 5, 2025

Conversation

frederik-h
Copy link
Contributor

@frederik-h frederik-h commented Apr 30, 2025

The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC.

Closes #133431.

…K_B32_e32

The problem with V_CNDMASK_B32_e64 (i.e. that conversion to the VOP2
SDWA form introduces an implicit VCC use) hinted at by the comment
does not exist with V_CNDMASK_B32_e32. Hence the latter should already
be acceptable for conversion to SDWA without further ado.
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and either change the instruction that
defines the carry-in operand to write to VCC if this is possible or
introduce a write of the carry-in operand to VCC.
@llvmbot
Copy link
Member

llvmbot commented Apr 30, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Frederik Harwath (frederik-h)

Changes

The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and either change the instruction that
defines the carry-in operand to write to VCC if this is possible or
introduce a write of the carry-in operand to VCC.

Closes #133431.


Patch is 470.84 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137930.diff

26 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp (+90-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+441-611)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+6-8)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-fmul-sel.ll (+14-26)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-f16.ll (+44-44)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-i16.ll (+455-115)
  • (modified) llvm/test/CodeGen/AMDGPU/fmax_legacy.f16.ll (+23-47)
  • (modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+99-156)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.f16.ll (+23-47)
  • (modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+99-156)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+11-19)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2bf16.ll (+319-331)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+209-221)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+150-232)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+150-232)
  • (modified) llvm/test/CodeGen/AMDGPU/maximumnum.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/minimumnum.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/saddsat.ll (+13-13)
  • (added) llvm/test/CodeGen/AMDGPU/sdwa-peephole-vcnd_mask-1.mir (+48)
  • (added) llvm/test/CodeGen/AMDGPU/sdwa-peephole-vcnd_mask-2.mir (+43)
  • (added) llvm/test/CodeGen/AMDGPU/sdwa-peephole-vcndmask.mir (+41)
  • (modified) llvm/test/CodeGen/AMDGPU/select-fabs-fneg-extract.v2f16.ll (+108-224)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+248-394)
  • (modified) llvm/test/CodeGen/AMDGPU/ssubsat.ll (+13-13)
diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 22f23e4c94e2d..f5f808623cc0c 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -62,6 +62,7 @@ class SIPeepholeSDWA {
   std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);
   void pseudoOpConvertToVOP2(MachineInstr &MI,
                              const GCNSubtarget &ST) const;
+  void convertToImplicitVcc(MachineInstr &MI, const GCNSubtarget &ST) const;
   MachineInstr *createSDWAVersion(MachineInstr &MI);
   bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
   void legalizeScalarOperands(MachineInstr &MI, const GCNSubtarget &ST) const;
@@ -1061,6 +1062,79 @@ void SIPeepholeSDWA::pseudoOpConvertToVOP2(MachineInstr &MI,
   MISucc.substituteRegister(CarryIn->getReg(), TRI->getVCC(), 0, *TRI);
 }
 
+static unsigned getVCmpEqOpcode(unsigned Bits) {
+  if (Bits == 64)
+    return AMDGPU::V_CMP_EQ_U64_e64;
+  if (Bits == 32)
+    return AMDGPU::V_CMP_EQ_U32_e64;
+  if (Bits == 16)
+    return AMDGPU::V_CMP_EQ_U16_e64;
+
+  llvm_unreachable("Unexpected register bit width.");
+};
+
+/// Try to convert an \p MI in VOP3 which takes an src2 carry-in
+/// operand into the corresponding VOP2 form which expects the
+/// argument in VCC. To this end, either try to change the definition
+/// of the carry-in operand to write to VCC or add an instruction that
+/// copies from the carry-in to VCC.  The conversion will only be
+/// applied if \p MI can be shrunk to VOP2 and if VCC can be proven to
+/// be dead before \p MI.
+void SIPeepholeSDWA::convertToImplicitVcc(MachineInstr &MI,
+                                          const GCNSubtarget &ST) const {
+  assert(MI.getOpcode() == AMDGPU::V_CNDMASK_B32_e64);
+
+  MCRegister Vcc = TRI->getVCC();
+  // FIXME Conversion introduces implicit vcc_hi use
+  if (Vcc == AMDGPU::VCC_LO)
+    return;
+
+  LLVM_DEBUG(dbgs() << "Attempting VOP2 conversion: " << MI);
+  if (!TII->canShrink(MI, *MRI)) {
+    LLVM_DEBUG(dbgs() << "Cannot shrink instruction\n");
+    return;
+  }
+
+  const MachineOperand &CarryIn =
+      *TII->getNamedOperand(MI, AMDGPU::OpName::src2);
+
+  // Make sure VCC or its subregs are dead before MI.
+  MachineBasicBlock &MBB = *MI.getParent();
+  auto Liveness = MBB.computeRegisterLiveness(TRI, Vcc, MI, 100);
+  if (Liveness != MachineBasicBlock::LQR_Dead) {
+    LLVM_DEBUG(dbgs() << "VCC not known to be dead before instruction.\n");
+    return;
+  }
+  // Change destination of compare instruction to VCC
+  // or copy to VCC if carry-in is not a compare inst.
+  Register CarryReg = CarryIn.getReg();
+  MachineInstr &CarryDef = *MRI->getVRegDef(CarryReg);
+
+  if (CarryDef.isCompare() && TII->isVOP3(CarryDef) &&
+      MRI->hasOneUse(CarryIn.getReg())) {
+    CarryDef.substituteRegister(CarryIn.getReg(), Vcc, 0, *TRI);
+    CarryDef.moveBefore(&MI);
+  } else {
+    // Add write: VCC[lanedId] <- (CarryIn[laneId] == 1)
+    const TargetRegisterClass *Class =
+        TRI->getRegClassForOperandReg(*MRI, CarryIn);
+    unsigned RegSize = Class->MC->getSizeInBits();
+    BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(getVCmpEqOpcode(RegSize)))
+        .addReg(Vcc, RegState::Define)
+        .addImm(1)
+        .add(CarryIn);
+  }
+
+  auto Converted = BuildMI(MBB, MI, MI.getDebugLoc(),
+                           TII->get(AMDGPU::getVOPe32(MI.getOpcode())))
+                       .add(*TII->getNamedOperand(MI, AMDGPU::OpName::vdst))
+                       .add(*TII->getNamedOperand(MI, AMDGPU::OpName::src0))
+                       .add(*TII->getNamedOperand(MI, AMDGPU::OpName::src1))
+                       .setMIFlags(MI.getFlags());
+  LLVM_DEBUG(dbgs() << "Converted to VOP2: " << *Converted << '\n');
+  MI.eraseFromParent();
+}
+
 namespace {
 bool isConvertibleToSDWA(MachineInstr &MI,
                          const GCNSubtarget &ST,
@@ -1070,6 +1144,11 @@ bool isConvertibleToSDWA(MachineInstr &MI,
   if (TII->isSDWA(Opc))
     return true;
 
+  // Can only be handled after ealier conversion to
+  // AMDGPU::V_CNDMASK_B32_e32 which is not always possible.
+  if (Opc == AMDGPU::V_CNDMASK_B32_e64)
+    return false;
+
   // Check if this instruction has opcode that supports SDWA
   if (AMDGPU::getSDWAOp(Opc) == -1)
     Opc = AMDGPU::getVOPe32(Opc);
@@ -1108,10 +1187,6 @@ bool isConvertibleToSDWA(MachineInstr &MI,
   if (TII->pseudoToMCOpcode(Opc) == -1)
     return false;
 
-  // FIXME: has SDWA but require handling of implicit VCC use
-  if (Opc == AMDGPU::V_CNDMASK_B32_e32)
-    return false;
-
   if (MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0)) {
     if (!Src0->isReg() && !Src0->isImm())
       return false;
@@ -1384,10 +1459,18 @@ bool SIPeepholeSDWA::run(MachineFunction &MF) {
       for (const auto &OperandPair : SDWAOperands) {
         const auto &Operand = OperandPair.second;
         MachineInstr *PotentialMI = Operand->potentialToConvert(TII, ST);
-        if (PotentialMI &&
-           (PotentialMI->getOpcode() == AMDGPU::V_ADD_CO_U32_e64 ||
-            PotentialMI->getOpcode() == AMDGPU::V_SUB_CO_U32_e64))
+        if (!PotentialMI)
+          continue;
+
+        switch (PotentialMI->getOpcode()) {
+        case AMDGPU::V_ADD_CO_U32_e64:
+        case AMDGPU::V_SUB_CO_U32_e64:
           pseudoOpConvertToVOP2(*PotentialMI, ST);
+          break;
+        case AMDGPU::V_CNDMASK_B32_e64:
+          convertToImplicitVcc(*PotentialMI, ST);
+          break;
+        };
       }
       SDWAOperands.clear();
 
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index 19b6ff68b9869..e172bf090cca7 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -38481,10 +38481,8 @@ define <2 x bfloat> @v_select_v2bf16(i1 %cond, <2 x bfloat> %a, <2 x bfloat> %b)
 ; GFX8-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v2, v1, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GFX8-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX8-NEXT:    v_cmp_eq_u64_e64 vcc, 1, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v2, v1, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -38494,10 +38492,9 @@ define <2 x bfloat> @v_select_v2bf16(i1 %cond, <2 x bfloat> %a, <2 x bfloat> %b)
 ; GFX9-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v1, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
-; GFX9-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cmp_eq_u64_e64 vcc, 1, vcc
 ; GFX9-NEXT:    s_mov_b32 s4, 0x5040100
+; GFX9-NEXT:    v_cndmask_b32_sdwa v1, v2, v1, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s4
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -38581,11 +38578,8 @@ define <2 x bfloat> @v_vselect_v2bf16(<2 x i1> %cond, <2 x bfloat> %a, <2 x bflo
 ; GFX8-NEXT:    v_and_b32_e32 v1, 1, v1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v3, v2, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GFX8-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v3, v2, vcc
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v3, v2, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -38596,10 +38590,8 @@ define <2 x bfloat> @v_vselect_v2bf16(<2 x i1> %cond, <2 x bfloat> %a, <2 x bflo
 ; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v2, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
-; GFX9-NEXT:    v_lshrrev_b32_e32 v3, 16, v3
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v3, v2, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v1, v3, v2, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    s_mov_b32 s4, 0x5040100
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s4
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
@@ -38767,17 +38759,17 @@ define amdgpu_ps i32 @s_select_v2bf16(<2 x bfloat> inreg %a, <2 x bfloat> inreg
 ;
 ; GFX8-LABEL: s_select_v2bf16:
 ; GFX8:       ; %bb.0:
+; GFX8-NEXT:    v_mov_b32_e32 v2, s1
+; GFX8-NEXT:    v_mov_b32_e32 v3, s0
+; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
 ; GFX8-NEXT:    s_lshr_b32 s2, s0, 16
 ; GFX8-NEXT:    s_lshr_b32 s3, s1, 16
+; GFX8-NEXT:    v_cndmask_b32_e32 v0, v2, v3, vcc
+; GFX8-NEXT:    v_cmp_eq_u64_e64 vcc, 1, vcc
 ; GFX8-NEXT:    v_mov_b32_e32 v1, s3
 ; GFX8-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT:    v_cndmask_b32_e32 v0, v1, v2, vcc
-; GFX8-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8-NEXT:    v_mov_b32_e32 v2, s0
-; GFX8-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v1, v2, vcc
-; GFX8-NEXT:    v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v1, v2, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
 ; GFX8-NEXT:    ; return to shader part epilog
 ;
@@ -38885,11 +38877,10 @@ define amdgpu_ps i32 @s_vselect_v2bf16(<2 x bfloat> inreg %a, <2 x bfloat> inreg
 ; GFX8-NEXT:    v_mov_b32_e32 v2, s3
 ; GFX8-NEXT:    v_mov_b32_e32 v3, s2
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v2, v3, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v2, v3, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX8-NEXT:    v_mov_b32_e32 v2, s1
 ; GFX8-NEXT:    v_mov_b32_e32 v3, s0
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v2, v3, vcc
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
@@ -40567,11 +40558,10 @@ define amdgpu_ps <2 x i32> @s_vselect_v4bf16(<4 x bfloat> inreg %a, <4 x bfloat>
 ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
 ; GFX8-NEXT:    v_mov_b32_e32 v5, s4
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v3
-; GFX8-NEXT:    v_cndmask_b32_e32 v3, v4, v5, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v3, v4, v5, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX8-NEXT:    v_mov_b32_e32 v4, s3
 ; GFX8-NEXT:    v_mov_b32_e32 v5, s1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v2
-; GFX8-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
 ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v4, v5, vcc
 ; GFX8-NEXT:    s_lshr_b32 s1, s0, 16
 ; GFX8-NEXT:    s_lshr_b32 s3, s2, 16
@@ -40579,11 +40569,10 @@ define amdgpu_ps <2 x i32> @s_vselect_v4bf16(<4 x bfloat> inreg %a, <4 x bfloat>
 ; GFX8-NEXT:    v_mov_b32_e32 v3, s3
 ; GFX8-NEXT:    v_mov_b32_e32 v4, s1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v3, v4, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v3, v4, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
 ; GFX8-NEXT:    v_mov_b32_e32 v3, s2
 ; GFX8-NEXT:    v_mov_b32_e32 v4, s0
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v3, v4, vcc
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
@@ -40769,24 +40758,18 @@ define <4 x bfloat> @v_vselect_v4bf16(<4 x i1> %cond, <4 x bfloat> %a, <4 x bflo
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_and_b32_e32 v3, 1, v3
 ; GFX8-NEXT:    v_and_b32_e32 v2, 1, v2
-; GFX8-NEXT:    v_lshrrev_b32_e32 v8, 16, v5
-; GFX8-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
 ; GFX8-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v3, v9, v8, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v3, v7, v5, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
 ; GFX8-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v7, v5, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v5, 16, v4
-; GFX8-NEXT:    v_lshrrev_b32_e32 v7, 16, v6
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v7, v5, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v6, v4, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v6, v4, vcc
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v3
-; GFX8-NEXT:    v_or_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: v_vselect_v4bf16:
@@ -40797,17 +40780,13 @@ define <4 x bfloat> @v_vselect_v4bf16(<4 x i1> %cond, <4 x bfloat> %a, <4 x bflo
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
 ; GFX9-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v5, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v5, 16, v5
-; GFX9-NEXT:    v_lshrrev_b32_e32 v7, 16, v7
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
 ; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v7, v5, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v3, v7, v5, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v6, v4, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v4, 16, v4
-; GFX9-NEXT:    v_lshrrev_b32_e32 v5, 16, v6
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v5, v4, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v1, v6, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    s_mov_b32 s4, 0x5040100
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s4
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s4
@@ -40996,44 +40975,32 @@ define <8 x bfloat> @v_vselect_v8bf16(<8 x i1> %cond, <8 x bfloat> %a, <8 x bflo
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_and_b32_e32 v7, 1, v7
 ; GFX8-NEXT:    v_and_b32_e32 v6, 1, v6
-; GFX8-NEXT:    v_lshrrev_b32_e32 v16, 16, v11
-; GFX8-NEXT:    v_lshrrev_b32_e32 v17, 16, v15
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v7
 ; GFX8-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX8-NEXT:    v_cndmask_b32_e32 v7, v17, v16, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v7, v15, v11, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v6
 ; GFX8-NEXT:    v_and_b32_e32 v4, 1, v4
 ; GFX8-NEXT:    v_cndmask_b32_e32 v6, v15, v11, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v11, 16, v10
-; GFX8-NEXT:    v_lshrrev_b32_e32 v15, 16, v14
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v5
 ; GFX8-NEXT:    v_and_b32_e32 v3, 1, v3
-; GFX8-NEXT:    v_cndmask_b32_e32 v5, v15, v11, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v5, v14, v10, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v4
 ; GFX8-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX8-NEXT:    v_cndmask_b32_e32 v4, v14, v10, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v10, 16, v9
-; GFX8-NEXT:    v_lshrrev_b32_e32 v11, 16, v13
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
 ; GFX8-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v3, v11, v10, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v3, v13, v9, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
 ; GFX8-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v13, v9, vcc
-; GFX8-NEXT:    v_lshrrev_b32_e32 v9, 16, v8
-; GFX8-NEXT:    v_lshrrev_b32_e32 v10, 16, v12
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT:    v_cndmask_b32_e32 v1, v10, v9, vcc
+; GFX8-NEXT:    v_cndmask_b32_sdwa v1, v12, v8, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX8-NEXT:    v_cndmask_b32_e32 v0, v12, v8, vcc
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX8-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_lshlrev_b32_e32 v1, 16, v3
-; GFX8-NEXT:    v_or_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_lshlrev_b32_e32 v2, 16, v5
-; GFX8-NEXT:    v_lshlrev_b32_e32 v3, 16, v7
-; GFX8-NEXT:    v_or_b32_sdwa v2, v4, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT:    v_or_b32_sdwa v3, v6, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v2, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT:    v_or_b32_sdwa v3, v6, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: v_vselect_v8bf16:
@@ -41044,33 +41011,25 @@ define <8 x bfloat> @v_vselect_v8bf16(<8 x i1> %cond, <8 x bfloat> %a, <8 x bflo
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v6
 ; GFX9-NEXT:    v_and_b32_e32 v4, 1, v4
 ; GFX9-NEXT:    v_cndmask_b32_e32 v6, v15, v11, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v11, 16, v11
-; GFX9-NEXT:    v_lshrrev_b32_e32 v15, 16, v15
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v7
 ; GFX9-NEXT:    v_and_b32_e32 v5, 1, v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v7, v15, v11, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v7, v15, v11, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v4
 ; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
 ; GFX9-NEXT:    v_cndmask_b32_e32 v4, v14, v10, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v10, 16, v10
-; GFX9-NEXT:    v_lshrrev_b32_e32 v11, 16, v14
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v5
 ; GFX9-NEXT:    v_and_b32_e32 v3, 1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v5, v11, v10, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v5, v14, v10, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
 ; GFX9-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v2, v13, v9, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v9, 16, v9
-; GFX9-NEXT:    v_lshrrev_b32_e32 v10, 16, v13
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
 ; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v10, v9, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v3, v13, v9, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v12, v8, vcc
-; GFX9-NEXT:    v_lshrrev_b32_e32 v8, 16, v8
-; GFX9-NEXT:    v_lshrrev_b32_e32 v9, 16, v12
 ; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v9, v8, vcc
+; GFX9-NEXT:    v_cndmask_b32_sdwa v1, v12, v8, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX9-NEXT:    s_mov_b32 s4, 0x5040100
 ; GFX9-NEXT:    v_perm_b32 v0, v1, v0, s4
 ; GFX9-NEXT:    v_perm_b32 v1, v3, v2, s4
@@ -41466,168 +41425,128 @@ define <16 x bfloat> @v_vselect_v16bf16(<16 x i1> %cond, <16 x bfloat> %a, <16 x
 ; GFX8-LABEL: v_vselect_v16bf16:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT:    v_and_b32_e32 v15, 1, v15
+; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v15
+; GFX8-...
[truncated]

@frederik-h frederik-h requested a review from arsenm April 30, 2025 13:07
Copy link

github-actions bot commented May 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@frederik-h frederik-h requested a review from arsenm May 2, 2025 12:46
frederik-h and others added 4 commits May 2, 2025 16:46
- Compact reg numbers in vop test
- Remove "undef"
- Readjust types in wave32 test
@frederik-h frederik-h requested a review from arsenm May 5, 2025 09:10
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except there are still tests using undef flags on operands which really have defs

frederik-h and others added 2 commits May 5, 2025 13:22
frederik-h added 2 commits May 5, 2025 08:56
- Move VOP2 tests cases into VOP3 wave32, wave64 test files
- Adjust test names and comments to the fact that we no longer attempt
to change the carry-in operands to VOP2.
... since those files now also cover the vop2 case.
@frederik-h frederik-h requested a review from arsenm May 5, 2025 14:14
@frederik-h frederik-h merged commit 721cba4 into llvm:main May 5, 2025
11 checks passed
@frederik-h
Copy link
Contributor Author

@arsenm Thanks for the very helpful review!

@frederik-h frederik-h deleted the sdwa-peephole-vcndmask branch May 5, 2025 16:14
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand
to VCC.

Closes llvm#133431.

---------

Co-authored-by: Matt Arsenault <[email protected]>
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand
to VCC.

Closes llvm#133431.

---------

Co-authored-by: Matt Arsenault <[email protected]>
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand
to VCC.

Closes llvm#133431.

---------

Co-authored-by: Matt Arsenault <[email protected]>
IanWood1 added a commit to iree-org/llvm-project that referenced this pull request May 6, 2025
IanWood1 added a commit to IanWood1/iree that referenced this pull request May 6, 2025
Reverts llvm/llvm-project#137930 and
llvm/llvm-project@e1cff21
to fix failure of e2e_matmul_cdna3_pad_i8_rocm_hip.

Signed-off-by: Ian Wood <[email protected]>
IanWood1 added a commit to iree-org/iree that referenced this pull request May 6, 2025
- Carries the same 4 reverts from
#20674.
- Uses enum when building `LLVM::GEPOp` (llvm/llvm-project#137272)
- Reverts
llvm/llvm-project@7318074
which can be undone after torch-mlir and stablehlo have been updated.
- Reverts llvm/llvm-project#137930 and
llvm/llvm-project@e1cff21
to fix correctness failure of e2e_matmul_cdna3_pad_i8_rocm_hip.

---------

Signed-off-by: Ian Wood <[email protected]>
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.

Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand
to VCC.

Closes llvm#133431.

---------

Co-authored-by: Matt Arsenault <[email protected]>
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this pull request May 7, 2025
- Carries the same 4 reverts from
iree-org#20674.
- Uses enum when building `LLVM::GEPOp` (llvm/llvm-project#137272)
- Reverts
llvm/llvm-project@7318074
which can be undone after torch-mlir and stablehlo have been updated.
- Reverts llvm/llvm-project#137930 and
llvm/llvm-project@e1cff21
to fix correctness failure of e2e_matmul_cdna3_pad_i8_rocm_hip.

---------

Signed-off-by: Ian Wood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] SIPeepholeSDWA: Add V_CNDMASK_B32 support
3 participants