-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AMDGPU] SIPeepholeSDWA: Handle V_CNDMASK_B32_e64 #137930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…K_B32_e32 The problem with V_CNDMASK_B32_e64 (i.e. that conversion to the VOP2 SDWA form introduces an implicit VCC use) hinted at by the comment does not exist with V_CNDMASK_B32_e32. Hence the latter should already be acceptable for conversion to SDWA without further ado.
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in operand. The conversion to SDWA implies a conversion to VOP2 form which reads from VCC instead. Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA to V_CNDMASK_B32_e32 first and either change the instruction that defines the carry-in operand to write to VCC if this is possible or introduce a write of the carry-in operand to VCC.
@llvm/pr-subscribers-backend-amdgpu Author: Frederik Harwath (frederik-h) ChangesThe VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA Closes #133431. Patch is 470.84 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137930.diff 26 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 22f23e4c94e2d..f5f808623cc0c 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -62,6 +62,7 @@ class SIPeepholeSDWA {
std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);
void pseudoOpConvertToVOP2(MachineInstr &MI,
const GCNSubtarget &ST) const;
+ void convertToImplicitVcc(MachineInstr &MI, const GCNSubtarget &ST) const;
MachineInstr *createSDWAVersion(MachineInstr &MI);
bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
void legalizeScalarOperands(MachineInstr &MI, const GCNSubtarget &ST) const;
@@ -1061,6 +1062,79 @@ void SIPeepholeSDWA::pseudoOpConvertToVOP2(MachineInstr &MI,
MISucc.substituteRegister(CarryIn->getReg(), TRI->getVCC(), 0, *TRI);
}
+static unsigned getVCmpEqOpcode(unsigned Bits) {
+ if (Bits == 64)
+ return AMDGPU::V_CMP_EQ_U64_e64;
+ if (Bits == 32)
+ return AMDGPU::V_CMP_EQ_U32_e64;
+ if (Bits == 16)
+ return AMDGPU::V_CMP_EQ_U16_e64;
+
+ llvm_unreachable("Unexpected register bit width.");
+};
+
+/// Try to convert an \p MI in VOP3 which takes an src2 carry-in
+/// operand into the corresponding VOP2 form which expects the
+/// argument in VCC. To this end, either try to change the definition
+/// of the carry-in operand to write to VCC or add an instruction that
+/// copies from the carry-in to VCC. The conversion will only be
+/// applied if \p MI can be shrunk to VOP2 and if VCC can be proven to
+/// be dead before \p MI.
+void SIPeepholeSDWA::convertToImplicitVcc(MachineInstr &MI,
+ const GCNSubtarget &ST) const {
+ assert(MI.getOpcode() == AMDGPU::V_CNDMASK_B32_e64);
+
+ MCRegister Vcc = TRI->getVCC();
+ // FIXME Conversion introduces implicit vcc_hi use
+ if (Vcc == AMDGPU::VCC_LO)
+ return;
+
+ LLVM_DEBUG(dbgs() << "Attempting VOP2 conversion: " << MI);
+ if (!TII->canShrink(MI, *MRI)) {
+ LLVM_DEBUG(dbgs() << "Cannot shrink instruction\n");
+ return;
+ }
+
+ const MachineOperand &CarryIn =
+ *TII->getNamedOperand(MI, AMDGPU::OpName::src2);
+
+ // Make sure VCC or its subregs are dead before MI.
+ MachineBasicBlock &MBB = *MI.getParent();
+ auto Liveness = MBB.computeRegisterLiveness(TRI, Vcc, MI, 100);
+ if (Liveness != MachineBasicBlock::LQR_Dead) {
+ LLVM_DEBUG(dbgs() << "VCC not known to be dead before instruction.\n");
+ return;
+ }
+ // Change destination of compare instruction to VCC
+ // or copy to VCC if carry-in is not a compare inst.
+ Register CarryReg = CarryIn.getReg();
+ MachineInstr &CarryDef = *MRI->getVRegDef(CarryReg);
+
+ if (CarryDef.isCompare() && TII->isVOP3(CarryDef) &&
+ MRI->hasOneUse(CarryIn.getReg())) {
+ CarryDef.substituteRegister(CarryIn.getReg(), Vcc, 0, *TRI);
+ CarryDef.moveBefore(&MI);
+ } else {
+ // Add write: VCC[lanedId] <- (CarryIn[laneId] == 1)
+ const TargetRegisterClass *Class =
+ TRI->getRegClassForOperandReg(*MRI, CarryIn);
+ unsigned RegSize = Class->MC->getSizeInBits();
+ BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(getVCmpEqOpcode(RegSize)))
+ .addReg(Vcc, RegState::Define)
+ .addImm(1)
+ .add(CarryIn);
+ }
+
+ auto Converted = BuildMI(MBB, MI, MI.getDebugLoc(),
+ TII->get(AMDGPU::getVOPe32(MI.getOpcode())))
+ .add(*TII->getNamedOperand(MI, AMDGPU::OpName::vdst))
+ .add(*TII->getNamedOperand(MI, AMDGPU::OpName::src0))
+ .add(*TII->getNamedOperand(MI, AMDGPU::OpName::src1))
+ .setMIFlags(MI.getFlags());
+ LLVM_DEBUG(dbgs() << "Converted to VOP2: " << *Converted << '\n');
+ MI.eraseFromParent();
+}
+
namespace {
bool isConvertibleToSDWA(MachineInstr &MI,
const GCNSubtarget &ST,
@@ -1070,6 +1144,11 @@ bool isConvertibleToSDWA(MachineInstr &MI,
if (TII->isSDWA(Opc))
return true;
+ // Can only be handled after ealier conversion to
+ // AMDGPU::V_CNDMASK_B32_e32 which is not always possible.
+ if (Opc == AMDGPU::V_CNDMASK_B32_e64)
+ return false;
+
// Check if this instruction has opcode that supports SDWA
if (AMDGPU::getSDWAOp(Opc) == -1)
Opc = AMDGPU::getVOPe32(Opc);
@@ -1108,10 +1187,6 @@ bool isConvertibleToSDWA(MachineInstr &MI,
if (TII->pseudoToMCOpcode(Opc) == -1)
return false;
- // FIXME: has SDWA but require handling of implicit VCC use
- if (Opc == AMDGPU::V_CNDMASK_B32_e32)
- return false;
-
if (MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0)) {
if (!Src0->isReg() && !Src0->isImm())
return false;
@@ -1384,10 +1459,18 @@ bool SIPeepholeSDWA::run(MachineFunction &MF) {
for (const auto &OperandPair : SDWAOperands) {
const auto &Operand = OperandPair.second;
MachineInstr *PotentialMI = Operand->potentialToConvert(TII, ST);
- if (PotentialMI &&
- (PotentialMI->getOpcode() == AMDGPU::V_ADD_CO_U32_e64 ||
- PotentialMI->getOpcode() == AMDGPU::V_SUB_CO_U32_e64))
+ if (!PotentialMI)
+ continue;
+
+ switch (PotentialMI->getOpcode()) {
+ case AMDGPU::V_ADD_CO_U32_e64:
+ case AMDGPU::V_SUB_CO_U32_e64:
pseudoOpConvertToVOP2(*PotentialMI, ST);
+ break;
+ case AMDGPU::V_CNDMASK_B32_e64:
+ convertToImplicitVcc(*PotentialMI, ST);
+ break;
+ };
}
SDWAOperands.clear();
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index 19b6ff68b9869..e172bf090cca7 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -38481,10 +38481,8 @@ define <2 x bfloat> @v_select_v2bf16(i1 %cond, <2 x bfloat> %a, <2 x bfloat> %b)
; GFX8-NEXT: v_and_b32_e32 v0, 1, v0
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v0, v2, v1, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v1, 16, v1
-; GFX8-NEXT: v_lshrrev_b32_e32 v2, 16, v2
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
+; GFX8-NEXT: v_cmp_eq_u64_e64 vcc, 1, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v2, v1, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
@@ -38494,10 +38492,9 @@ define <2 x bfloat> @v_select_v2bf16(i1 %cond, <2 x bfloat> %a, <2 x bfloat> %b)
; GFX9-NEXT: v_and_b32_e32 v0, 1, v0
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v0, v2, v1, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v1, 16, v1
-; GFX9-NEXT: v_lshrrev_b32_e32 v2, 16, v2
-; GFX9-NEXT: v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT: v_cmp_eq_u64_e64 vcc, 1, vcc
; GFX9-NEXT: s_mov_b32 s4, 0x5040100
+; GFX9-NEXT: v_cndmask_b32_sdwa v1, v2, v1, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
; GFX9-NEXT: s_setpc_b64 s[30:31]
;
@@ -38581,11 +38578,8 @@ define <2 x bfloat> @v_vselect_v2bf16(<2 x i1> %cond, <2 x bfloat> %a, <2 x bflo
; GFX8-NEXT: v_and_b32_e32 v1, 1, v1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v0, v3, v2, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v2, 16, v2
-; GFX8-NEXT: v_lshrrev_b32_e32 v3, 16, v3
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v3, v2, vcc
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v3, v2, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
@@ -38596,10 +38590,8 @@ define <2 x bfloat> @v_vselect_v2bf16(<2 x i1> %cond, <2 x bfloat> %a, <2 x bflo
; GFX9-NEXT: v_and_b32_e32 v1, 1, v1
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v0, v3, v2, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v2, 16, v2
-; GFX9-NEXT: v_lshrrev_b32_e32 v3, 16, v3
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT: v_cndmask_b32_e32 v1, v3, v2, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v1, v3, v2, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: s_mov_b32 s4, 0x5040100
; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -38767,17 +38759,17 @@ define amdgpu_ps i32 @s_select_v2bf16(<2 x bfloat> inreg %a, <2 x bfloat> inreg
;
; GFX8-LABEL: s_select_v2bf16:
; GFX8: ; %bb.0:
+; GFX8-NEXT: v_mov_b32_e32 v2, s1
+; GFX8-NEXT: v_mov_b32_e32 v3, s0
+; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
; GFX8-NEXT: s_lshr_b32 s2, s0, 16
; GFX8-NEXT: s_lshr_b32 s3, s1, 16
+; GFX8-NEXT: v_cndmask_b32_e32 v0, v2, v3, vcc
+; GFX8-NEXT: v_cmp_eq_u64_e64 vcc, 1, vcc
; GFX8-NEXT: v_mov_b32_e32 v1, s3
; GFX8-NEXT: v_mov_b32_e32 v2, s2
-; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT: v_cndmask_b32_e32 v0, v1, v2, vcc
-; GFX8-NEXT: v_mov_b32_e32 v1, s1
-; GFX8-NEXT: v_mov_b32_e32 v2, s0
-; GFX8-NEXT: v_lshlrev_b32_e32 v0, 16, v0
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v1, v2, vcc
-; GFX8-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v1, v2, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
+; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: v_readfirstlane_b32 s0, v0
; GFX8-NEXT: ; return to shader part epilog
;
@@ -38885,11 +38877,10 @@ define amdgpu_ps i32 @s_vselect_v2bf16(<2 x bfloat> inreg %a, <2 x bfloat> inreg
; GFX8-NEXT: v_mov_b32_e32 v2, s3
; GFX8-NEXT: v_mov_b32_e32 v3, s2
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v2, v3, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v2, v3, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
; GFX8-NEXT: v_mov_b32_e32 v2, s1
; GFX8-NEXT: v_mov_b32_e32 v3, s0
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX8-NEXT: v_cndmask_b32_e32 v0, v2, v3, vcc
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: v_readfirstlane_b32 s0, v0
@@ -40567,11 +40558,10 @@ define amdgpu_ps <2 x i32> @s_vselect_v4bf16(<4 x bfloat> inreg %a, <4 x bfloat>
; GFX8-NEXT: v_mov_b32_e32 v4, s5
; GFX8-NEXT: v_mov_b32_e32 v5, s4
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v3
-; GFX8-NEXT: v_cndmask_b32_e32 v3, v4, v5, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v3, v4, v5, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
; GFX8-NEXT: v_mov_b32_e32 v4, s3
; GFX8-NEXT: v_mov_b32_e32 v5, s1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
-; GFX8-NEXT: v_lshlrev_b32_e32 v3, 16, v3
; GFX8-NEXT: v_cndmask_b32_e32 v2, v4, v5, vcc
; GFX8-NEXT: s_lshr_b32 s1, s0, 16
; GFX8-NEXT: s_lshr_b32 s3, s2, 16
@@ -40579,11 +40569,10 @@ define amdgpu_ps <2 x i32> @s_vselect_v4bf16(<4 x bfloat> inreg %a, <4 x bfloat>
; GFX8-NEXT: v_mov_b32_e32 v3, s3
; GFX8-NEXT: v_mov_b32_e32 v4, s1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v3, v4, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v3, v4, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
; GFX8-NEXT: v_mov_b32_e32 v3, s2
; GFX8-NEXT: v_mov_b32_e32 v4, s0
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX8-NEXT: v_cndmask_b32_e32 v0, v3, v4, vcc
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: v_readfirstlane_b32 s0, v0
@@ -40769,24 +40758,18 @@ define <4 x bfloat> @v_vselect_v4bf16(<4 x i1> %cond, <4 x bfloat> %a, <4 x bflo
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: v_and_b32_e32 v3, 1, v3
; GFX8-NEXT: v_and_b32_e32 v2, 1, v2
-; GFX8-NEXT: v_lshrrev_b32_e32 v8, 16, v5
-; GFX8-NEXT: v_lshrrev_b32_e32 v9, 16, v7
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
; GFX8-NEXT: v_and_b32_e32 v1, 1, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v3, v9, v8, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v3, v7, v5, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
; GFX8-NEXT: v_and_b32_e32 v0, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v2, v7, v5, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v5, 16, v4
-; GFX8-NEXT: v_lshrrev_b32_e32 v7, 16, v6
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v7, v5, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v6, v4, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v0, v6, v4, vcc
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v3
-; GFX8-NEXT: v_or_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT: v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
; GFX9-LABEL: v_vselect_v4bf16:
@@ -40797,17 +40780,13 @@ define <4 x bfloat> @v_vselect_v4bf16(<4 x i1> %cond, <4 x bfloat> %a, <4 x bflo
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
; GFX9-NEXT: v_and_b32_e32 v0, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v2, v7, v5, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v5, 16, v5
-; GFX9-NEXT: v_lshrrev_b32_e32 v7, 16, v7
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
; GFX9-NEXT: v_and_b32_e32 v1, 1, v1
-; GFX9-NEXT: v_cndmask_b32_e32 v3, v7, v5, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v3, v7, v5, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v0, v6, v4, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v4, 16, v4
-; GFX9-NEXT: v_lshrrev_b32_e32 v5, 16, v6
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT: v_cndmask_b32_e32 v1, v5, v4, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v1, v6, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: s_mov_b32 s4, 0x5040100
; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
; GFX9-NEXT: v_perm_b32 v1, v3, v2, s4
@@ -40996,44 +40975,32 @@ define <8 x bfloat> @v_vselect_v8bf16(<8 x i1> %cond, <8 x bfloat> %a, <8 x bflo
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: v_and_b32_e32 v7, 1, v7
; GFX8-NEXT: v_and_b32_e32 v6, 1, v6
-; GFX8-NEXT: v_lshrrev_b32_e32 v16, 16, v11
-; GFX8-NEXT: v_lshrrev_b32_e32 v17, 16, v15
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v7
; GFX8-NEXT: v_and_b32_e32 v5, 1, v5
-; GFX8-NEXT: v_cndmask_b32_e32 v7, v17, v16, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v7, v15, v11, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v6
; GFX8-NEXT: v_and_b32_e32 v4, 1, v4
; GFX8-NEXT: v_cndmask_b32_e32 v6, v15, v11, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v11, 16, v10
-; GFX8-NEXT: v_lshrrev_b32_e32 v15, 16, v14
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v5
; GFX8-NEXT: v_and_b32_e32 v3, 1, v3
-; GFX8-NEXT: v_cndmask_b32_e32 v5, v15, v11, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v5, v14, v10, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v4
; GFX8-NEXT: v_and_b32_e32 v2, 1, v2
; GFX8-NEXT: v_cndmask_b32_e32 v4, v14, v10, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v10, 16, v9
-; GFX8-NEXT: v_lshrrev_b32_e32 v11, 16, v13
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
; GFX8-NEXT: v_and_b32_e32 v1, 1, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v3, v11, v10, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v3, v13, v9, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
; GFX8-NEXT: v_and_b32_e32 v0, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v2, v13, v9, vcc
-; GFX8-NEXT: v_lshrrev_b32_e32 v9, 16, v8
-; GFX8-NEXT: v_lshrrev_b32_e32 v10, 16, v12
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX8-NEXT: v_cndmask_b32_e32 v1, v10, v9, vcc
+; GFX8-NEXT: v_cndmask_b32_sdwa v1, v12, v8, vcc dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX8-NEXT: v_cndmask_b32_e32 v0, v12, v8, vcc
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v3
-; GFX8-NEXT: v_or_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT: v_lshlrev_b32_e32 v2, 16, v5
-; GFX8-NEXT: v_lshlrev_b32_e32 v3, 16, v7
-; GFX8-NEXT: v_or_b32_sdwa v2, v4, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
-; GFX8-NEXT: v_or_b32_sdwa v3, v6, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT: v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT: v_or_b32_sdwa v2, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
+; GFX8-NEXT: v_or_b32_sdwa v3, v6, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
; GFX9-LABEL: v_vselect_v8bf16:
@@ -41044,33 +41011,25 @@ define <8 x bfloat> @v_vselect_v8bf16(<8 x i1> %cond, <8 x bfloat> %a, <8 x bflo
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v6
; GFX9-NEXT: v_and_b32_e32 v4, 1, v4
; GFX9-NEXT: v_cndmask_b32_e32 v6, v15, v11, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v11, 16, v11
-; GFX9-NEXT: v_lshrrev_b32_e32 v15, 16, v15
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v7
; GFX9-NEXT: v_and_b32_e32 v5, 1, v5
-; GFX9-NEXT: v_cndmask_b32_e32 v7, v15, v11, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v7, v15, v11, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v4
; GFX9-NEXT: v_and_b32_e32 v2, 1, v2
; GFX9-NEXT: v_cndmask_b32_e32 v4, v14, v10, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v10, 16, v10
-; GFX9-NEXT: v_lshrrev_b32_e32 v11, 16, v14
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v5
; GFX9-NEXT: v_and_b32_e32 v3, 1, v3
-; GFX9-NEXT: v_cndmask_b32_e32 v5, v11, v10, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v5, v14, v10, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
; GFX9-NEXT: v_and_b32_e32 v0, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v2, v13, v9, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v9, 16, v9
-; GFX9-NEXT: v_lshrrev_b32_e32 v10, 16, v13
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
; GFX9-NEXT: v_and_b32_e32 v1, 1, v1
-; GFX9-NEXT: v_cndmask_b32_e32 v3, v10, v9, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v3, v13, v9, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
; GFX9-NEXT: v_cndmask_b32_e32 v0, v12, v8, vcc
-; GFX9-NEXT: v_lshrrev_b32_e32 v8, 16, v8
-; GFX9-NEXT: v_lshrrev_b32_e32 v9, 16, v12
; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
-; GFX9-NEXT: v_cndmask_b32_e32 v1, v9, v8, vcc
+; GFX9-NEXT: v_cndmask_b32_sdwa v1, v12, v8, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX9-NEXT: s_mov_b32 s4, 0x5040100
; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
; GFX9-NEXT: v_perm_b32 v1, v3, v2, s4
@@ -41466,168 +41425,128 @@ define <16 x bfloat> @v_vselect_v16bf16(<16 x i1> %cond, <16 x bfloat> %a, <16 x
; GFX8-LABEL: v_vselect_v16bf16:
; GFX8: ; %bb.0:
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT: v_and_b32_e32 v15, 1, v15
+; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v15
+; GFX8-...
[truncated]
|
- Don' use auto - Adapt other use for consistency - Use default threshold - Adjust tests
Co-authored-by: Matt Arsenault <[email protected]>
✅ With the latest revision this PR passed the C/C++ code formatter. |
Further changes: - Add debug output for missing carry-in def.
Co-authored-by: Matt Arsenault <[email protected]>
- Compact reg numbers in vop test - Remove "undef" - Readjust types in wave32 test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm except there are still tests using undef flags on operands which really have defs
Co-authored-by: Matt Arsenault <[email protected]>
Co-authored-by: Matt Arsenault <[email protected]>
- Move VOP2 tests cases into VOP3 wave32, wave64 test files - Adjust test names and comments to the fact that we no longer attempt to change the carry-in operands to VOP2.
... since those files now also cover the vop2 case.
@arsenm Thanks for the very helpful review! |
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in operand. The conversion to SDWA implies a conversion to VOP2 form which reads from VCC instead. Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC. Closes llvm#133431. --------- Co-authored-by: Matt Arsenault <[email protected]>
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in operand. The conversion to SDWA implies a conversion to VOP2 form which reads from VCC instead. Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC. Closes llvm#133431. --------- Co-authored-by: Matt Arsenault <[email protected]>
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in operand. The conversion to SDWA implies a conversion to VOP2 form which reads from VCC instead. Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC. Closes llvm#133431. --------- Co-authored-by: Matt Arsenault <[email protected]>
This reverts commit 721cba4. Signed-off-by: Ian Wood <[email protected]>
Reverts llvm/llvm-project#137930 and llvm/llvm-project@e1cff21 to fix failure of e2e_matmul_cdna3_pad_i8_rocm_hip. Signed-off-by: Ian Wood <[email protected]>
- Carries the same 4 reverts from #20674. - Uses enum when building `LLVM::GEPOp` (llvm/llvm-project#137272) - Reverts llvm/llvm-project@7318074 which can be undone after torch-mlir and stablehlo have been updated. - Reverts llvm/llvm-project#137930 and llvm/llvm-project@e1cff21 to fix correctness failure of e2e_matmul_cdna3_pad_i8_rocm_hip. --------- Signed-off-by: Ian Wood <[email protected]>
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in operand. The conversion to SDWA implies a conversion to VOP2 form which reads from VCC instead. Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC. Closes llvm#133431. --------- Co-authored-by: Matt Arsenault <[email protected]>
- Carries the same 4 reverts from iree-org#20674. - Uses enum when building `LLVM::GEPOp` (llvm/llvm-project#137272) - Reverts llvm/llvm-project@7318074 which can be undone after torch-mlir and stablehlo have been updated. - Reverts llvm/llvm-project#137930 and llvm/llvm-project@e1cff21 to fix correctness failure of e2e_matmul_cdna3_pad_i8_rocm_hip. --------- Signed-off-by: Ian Wood <[email protected]>
The VOP3 form of the V_CNDMASK_B32 instruction takes a carry-in
operand. The conversion to SDWA implies a conversion to VOP2 form
which reads from VCC instead.
Convert V_CNDMASK_B32_e64 instructions that might be converted to SDWA
to V_CNDMASK_B32_e32 first and introduce a copy of the carry-in operand to VCC.
Closes #133431.