[AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 #118037

broxigarchen · 2024-11-28T22:07:25Z

i16 can be selected into sgpr32 or vgpr16 in isel lowering in true16 mode. And thus, it creates cases that we copy from vgpr16 to sgpr32 in ext selection and this seems inevitable without sgpr16 support.

legalize the src/dst reg when we decide to lower this special copy to a readfirstlane in fix-sgpr-copy pass and add a lit test

llvmbot · 2024-11-28T22:08:05Z

@llvm/pr-subscribers-backend-amdgpu

Author: Brox Chen (broxigarchen)

Changes

V2S COPY can be emitted as either

sgpr_32 = COPY vgpr_16
or
sgpr_lo16 = COPY vgpr_16

Emit REG_SEQUENCE with hi16 bits undef in readfirstlane for 16 bit src

Full diff: https://github.com/llvm/llvm-project/pull/118037.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp (+19-4)
(added) llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir (+118)

diff --git a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
index ac69bf6d038ece..9749d09592bab6 100644
--- a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
         TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
     size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
     if (SrcSize == 16) {
-      // HACK to handle possible 16bit VGPR source
-      auto MIB = BuildMI(*MBB, MI, MI->getDebugLoc(),
-                         TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg);
-      MIB.addReg(SrcReg, 0, AMDGPU::NoSubRegister);
+      assert(MF.getSubtarget<GCNSubtarget>().useRealTrue16Insts() &&
+             "We do not expect to see 16-bit copies from VGPR to SGPR unless "
+             "we have 16-bit VGPRs");
+      assert(MRI->getRegClass(DstReg) == &AMDGPU::SGPR_LO16RegClass ||
+             MRI->getRegClass(DstReg) == &AMDGPU::SReg_32RegClass);
+      // There is no V_READFIRSTLANE_B16, so widen the destination scalar
+      // value to 32 bits
+      MRI->setRegClass(DstReg, &AMDGPU::SGPR_32RegClass);
+      Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+      const DebugLoc &DL = MI->getDebugLoc();
+      Register Undef = MRI->createVirtualRegister(&AMDGPU::VGPR_16RegClass);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::IMPLICIT_DEF), Undef);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::REG_SEQUENCE), TmpReg)
+          .addReg(SrcReg, 0, SubReg)
+          .addImm(AMDGPU::lo16)
+          .addReg(Undef)
+          .addImm(AMDGPU::hi16);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg)
+          .addReg(TmpReg);
     } else if (SrcSize == 32) {
       auto MIB = BuildMI(*MBB, MI, MI->getDebugLoc(),
                          TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg);
diff --git a/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir b/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir
new file mode 100644
index 00000000000000..640245b53b5c0a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir
@@ -0,0 +1,118 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 3
+# RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -run-pass=si-fix-sgpr-copies -verify-machineinstrs -o - %s | FileCheck %s
+
+# Ensure READFIRSTLANE is generated, and that its src is REG_SEQUENCE.
+
+---
+name:            test4
+tracksRegLiveness: true
+body:             |
+  ; CHECK-LABEL: name: test4
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:sgpr_128 = COPY undef %1:sgpr_128
+  ; CHECK-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:sgpr_128 = COPY undef %4:sgpr_128
+  ; CHECK-NEXT:   S_BRANCH %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[S_MOV_B32_]], %bb.0, %6, %bb.3
+  ; CHECK-NEXT:   [[PHI1:%[0-9]+]]:sreg_32 = PHI [[S_MOV_B32_]], %bb.0, %8, %bb.3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.3(0x40000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[BUFFER_LOAD_USHORT_OFFEN:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN [[COPY3]], [[COPY]], [[S_MOV_B32_1]], 0, 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[BUFFER_LOAD_USHORT_OFFEN1:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN [[COPY4]], [[COPY]], [[S_MOV_B32_1]], 2, 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:vgpr_16 = COPY [[BUFFER_LOAD_USHORT_OFFEN]].lo16
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:vgpr_16 = COPY [[BUFFER_LOAD_USHORT_OFFEN1]].lo16
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
+  ; CHECK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 65535, implicit $exec
+  ; CHECK-NEXT:   [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 killed [[V_MOV_B32_e32_]], [[COPY7]], implicit $exec
+  ; CHECK-NEXT:   [[V_LSHL_OR_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHL_OR_B32_e64 [[COPY8]], 16, killed [[V_AND_B32_e64_]], implicit $exec
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:sgpr_lo16 = COPY [[PHI1]].lo16
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:vgpr_16 = COPY [[COPY9]]
+  ; CHECK-NEXT:   [[V_SUB_NC_U16_t16_e64_:%[0-9]+]]:vgpr_16 = V_SUB_NC_U16_t16_e64 0, [[COPY10]], 0, killed [[COPY5]], 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[DEF:%[0-9]+]]:vgpr_16 = IMPLICIT_DEF
+  ; CHECK-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vgpr_32 = REG_SEQUENCE [[V_SUB_NC_U16_t16_e64_]], %subreg.lo16, [[DEF]], %subreg.hi16
+  ; CHECK-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]], implicit $exec
+  ; CHECK-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 255
+  ; CHECK-NEXT:   [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 killed [[V_READFIRSTLANE_B32_]], killed [[S_MOV_B32_2]], implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 12
+  ; CHECK-NEXT:   S_CMP_LT_I32 [[S_AND_B32_]], killed [[S_MOV_B32_3]], implicit-def $scc
+  ; CHECK-NEXT:   S_CBRANCH_SCC1 %bb.4, implicit $scc
+  ; CHECK-NEXT:   S_BRANCH %bb.3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.1(0x40000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
+  ; CHECK-NEXT:   [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; CHECK-NEXT:   [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 18
+  ; CHECK-NEXT:   S_CMP_LT_I32 [[S_AND_B32_]], killed [[S_MOV_B32_5]], implicit-def $scc
+  ; CHECK-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; CHECK-NEXT:   S_BRANCH %bb.4
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   S_ENDPGM 0
+  bb.1:
+    successors: %bb.3(0x80000000); %bb.3(100.00%)
+
+    %1:sgpr_128 = COPY undef %150:sgpr_128
+    %131:sreg_32 = S_MOV_B32 0
+    %2:sgpr_128 = COPY undef %151:sgpr_128
+    S_BRANCH %bb.3
+
+  bb.3:
+    successors: %bb.4(0x80000000); %bb.4(100.00%)
+
+    %3:sreg_32 = PHI %131:sreg_32, %bb.1, %183, %bb.5
+    %4:sreg_32 = PHI %131:sreg_32, %bb.1, %182, %bb.5
+
+  bb.4:
+    successors: %bb.6(0x40000000), %bb.5(0x40000000); %bb.5(50.00%), %bb.6(50.00%)
+
+    %154:sreg_32 = S_MOV_B32 0
+    %156:vgpr_32 = COPY %3:sreg_32
+    %162:vgpr_32 = COPY %3:sreg_32
+    %161:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN %162:vgpr_32, %1:sgpr_128, %154:sreg_32, 0, 0, 0, implicit $exec
+    %164:vgpr_32 = COPY %3:sreg_32
+    %163:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN %164:vgpr_32, %1:sgpr_128, %154:sreg_32, 2, 0, 0, implicit $exec
+    %9:vgpr_16 = COPY %161.lo16:vgpr_32
+    %10:vgpr_16 = COPY %163.lo16:vgpr_32
+    %165:sreg_32 = COPY %9:vgpr_16
+    %166:sreg_32 = COPY %10:vgpr_16
+    %12:sreg_32 = S_PACK_LL_B32_B16 %165:sreg_32, %166:sreg_32
+    %167:sgpr_lo16 = COPY %4.lo16:sreg_32
+    %170:vgpr_16 = COPY %167:sgpr_lo16
+    %177:vgpr_16 = V_SUB_NC_U16_t16_e64 0, %170:vgpr_16, 0, killed %9:vgpr_16, 0, 0, implicit $exec
+    %179:sreg_32 = COPY killed %177:vgpr_16
+    %180:sreg_32 = S_MOV_B32 255
+    %13:sreg_32 = S_AND_B32 killed %179:sreg_32, killed %180:sreg_32, implicit-def dead $scc
+    %181:sreg_32 = S_MOV_B32 12
+    S_CMP_LT_I32 %13:sreg_32, killed %181:sreg_32, implicit-def $scc
+    S_CBRANCH_SCC1 %bb.6, implicit $scc
+    S_BRANCH %bb.5
+
+  bb.5:
+    successors: %bb.6(0x40000000), %bb.3(0x40000000); %bb.6(50.00%), %bb.3(50.00%)
+
+    %183:sreg_32 = S_MOV_B32 -1
+    %182:sreg_32 = IMPLICIT_DEF
+    %184:sreg_32 = S_MOV_B32 18
+    S_CMP_LT_I32 %13:sreg_32, killed %184:sreg_32, implicit-def $scc
+    S_CBRANCH_SCC1 %bb.3, implicit $scc
+    S_BRANCH %bb.6
+
+  bb.6:
+      S_ENDPGM 0
+
+...

llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir

arsenm · 2024-12-03T15:26:48Z

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
        TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
    size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
    if (SrcSize == 16) {


Are you reaching here in practice?

Hi Matt. I think we probably never hit this before in the previous code. But with the true16 flow, we should hit here.

Not exactly, we don't want SIFixSGPRCopies to need to handle anything

Hi Matt. Can you help to elaborate a bit here? Do you think we should not see this 16bit handling in this pass? Thanks!

It does need to be handled here, but if you're seeing this in practice something else is likely missing

This code fixed the conformance test I was looking at, and as you say it should be handled here, so I think it should be merged.
Regarding an 'alternative fix', do you have any hypothesis what that would be? Is it a priority to find that?

It depends what it is. Either way that indicates we should have an end to end IR test for whatever this issue was as well

Updated this PR's decsription to answer this question

llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir

arsenm · 2025-01-14T05:34:15Z

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
        TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
    size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
    if (SrcSize == 16) {


Not exactly, we don't want SIFixSGPRCopies to need to handle anything

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

broxigarchen · 2025-05-02T16:34:27Z

After investigation, we still need this patch for true16 mode. Updated and trimed this patch and updated the lit test for review

Sisyph

Thanks for adding the IR test, it helps a lot. I would describe the situation as moving a uniform 16 bit value in a VGPR into an SGPR, when we don't have a readfirstlane_b16 instruction or a 16 bit SGPR. LGTM, but please wait for @arsenm.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

arsenm

lgtm with nit

llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-vgpr16-to-spgr32.ll

…vm#118037) i16 can be selected into sgpr32 or vgpr16 in isel lowering in true16 mode. And thus, it creates cases that we copy from vgpr16 to sgpr32 in ext selection and this seems inevitable without sgpr16 support. legalize the src/dst reg when we decide to lower this special copy to a readfirstlane in fix-sgpr-copy pass and add a lit test

broxigarchen requested a review from jayfoad November 28, 2024 22:07

llvmbot added the backend:AMDGPU label Nov 28, 2024

broxigarchen changed the title ~~Support V2S copy with True16 inst format.~~ [AMDGPU][True16][CodeGen]Support V2S copy with True16 inst format. Nov 28, 2024

broxigarchen requested a review from Sisyph November 28, 2024 22:10

Sisyph requested a review from alex-t December 2, 2024 16:04

arsenm reviewed Dec 3, 2024

View reviewed changes

broxigarchen changed the title ~~[AMDGPU][True16][CodeGen]Support V2S copy with True16 inst format.~~ [AMDGPU][True16][CodeGen]Support V2S copy with True16 flow Dec 5, 2024

arsenm reviewed Dec 6, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir Outdated Show resolved Hide resolved

llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir Outdated Show resolved Hide resolved

broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from 7e00db4 to bc0eb44 Compare January 13, 2025 22:49

arsenm reviewed Jan 14, 2025

View reviewed changes

broxigarchen marked this pull request as draft February 20, 2025 20:26

broxigarchen closed this Feb 26, 2025

broxigarchen reopened this Feb 26, 2025

broxigarchen changed the title ~~[AMDGPU][True16][CodeGen]Support V2S copy with True16 flow~~ [AMDGPU][True16][CodeGen] fix-sgpr-copy pass update for 16bit sgpr in true16 flow Feb 26, 2025

broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from bc0eb44 to 3b2e892 Compare February 26, 2025 19:00

broxigarchen closed this May 2, 2025

broxigarchen mentioned this pull request May 2, 2025

[AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 #137848

Closed

fix vgpr16 copy to sgpr32

2973ed7

broxigarchen reopened this May 2, 2025

broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from 3b2e892 to 2973ed7 Compare May 2, 2025 16:32

broxigarchen changed the title ~~[AMDGPU][True16][CodeGen] fix-sgpr-copy pass update for 16bit sgpr in true16 flow~~ [AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 May 2, 2025

broxigarchen marked this pull request as ready for review May 2, 2025 16:33

broxigarchen requested a review from arsenm May 2, 2025 16:33

broxigarchen requested a review from kosarev May 2, 2025 16:34

Sisyph reviewed May 2, 2025

View reviewed changes

arsenm reviewed May 5, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp Outdated Show resolved Hide resolved

replace sbureg_to_reg to reg_sequence

43a7ff5

arsenm approved these changes May 5, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-vgpr16-to-spgr32.ll Outdated Show resolved Hide resolved

update test

7706bea

arsenm approved these changes May 5, 2025

View reviewed changes

broxigarchen merged commit d4706e1 into llvm:main May 5, 2025
6 of 10 checks passed

[AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 #118037

[AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 #118037

Uh oh!

Conversation

broxigarchen commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 28, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

broxigarchen Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

broxigarchen commented May 2, 2025

Uh oh!

Sisyph left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

broxigarchen commented Nov 28, 2024 •

edited

Loading

broxigarchen Jan 13, 2025 •

edited

Loading