Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 #118037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 5, 2025

Conversation

broxigarchen
Copy link
Contributor

@broxigarchen broxigarchen commented Nov 28, 2024

i16 can be selected into sgpr32 or vgpr16 in isel lowering in true16 mode. And thus, it creates cases that we copy from vgpr16 to sgpr32 in ext selection and this seems inevitable without sgpr16 support.

legalize the src/dst reg when we decide to lower this special copy to a readfirstlane in fix-sgpr-copy pass and add a lit test

@broxigarchen broxigarchen requested a review from jayfoad November 28, 2024 22:07
@broxigarchen broxigarchen changed the title Support V2S copy with True16 inst format. [AMDGPU][True16][CodeGen]Support V2S copy with True16 inst format. Nov 28, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 28, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Brox Chen (broxigarchen)

Changes

V2S COPY can be emitted as either

sgpr_32 = COPY vgpr_16
or
sgpr_lo16 = COPY vgpr_16

Emit REG_SEQUENCE with hi16 bits undef in readfirstlane for 16 bit src


Full diff: https://github.com/llvm/llvm-project/pull/118037.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp (+19-4)
  • (added) llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir (+118)
diff --git a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
index ac69bf6d038ece..9749d09592bab6 100644
--- a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
         TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
     size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
     if (SrcSize == 16) {
-      // HACK to handle possible 16bit VGPR source
-      auto MIB = BuildMI(*MBB, MI, MI->getDebugLoc(),
-                         TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg);
-      MIB.addReg(SrcReg, 0, AMDGPU::NoSubRegister);
+      assert(MF.getSubtarget<GCNSubtarget>().useRealTrue16Insts() &&
+             "We do not expect to see 16-bit copies from VGPR to SGPR unless "
+             "we have 16-bit VGPRs");
+      assert(MRI->getRegClass(DstReg) == &AMDGPU::SGPR_LO16RegClass ||
+             MRI->getRegClass(DstReg) == &AMDGPU::SReg_32RegClass);
+      // There is no V_READFIRSTLANE_B16, so widen the destination scalar
+      // value to 32 bits
+      MRI->setRegClass(DstReg, &AMDGPU::SGPR_32RegClass);
+      Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+      const DebugLoc &DL = MI->getDebugLoc();
+      Register Undef = MRI->createVirtualRegister(&AMDGPU::VGPR_16RegClass);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::IMPLICIT_DEF), Undef);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::REG_SEQUENCE), TmpReg)
+          .addReg(SrcReg, 0, SubReg)
+          .addImm(AMDGPU::lo16)
+          .addReg(Undef)
+          .addImm(AMDGPU::hi16);
+      BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg)
+          .addReg(TmpReg);
     } else if (SrcSize == 32) {
       auto MIB = BuildMI(*MBB, MI, MI->getDebugLoc(),
                          TII->get(AMDGPU::V_READFIRSTLANE_B32), DstReg);
diff --git a/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir b/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir
new file mode 100644
index 00000000000000..640245b53b5c0a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/true16-copy-vgpr16-to-sgpr32.mir
@@ -0,0 +1,118 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 3
+# RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -run-pass=si-fix-sgpr-copies -verify-machineinstrs -o - %s | FileCheck %s
+
+# Ensure READFIRSTLANE is generated, and that its src is REG_SEQUENCE.
+
+---
+name:            test4
+tracksRegLiveness: true
+body:             |
+  ; CHECK-LABEL: name: test4
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:sgpr_128 = COPY undef %1:sgpr_128
+  ; CHECK-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:sgpr_128 = COPY undef %4:sgpr_128
+  ; CHECK-NEXT:   S_BRANCH %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[S_MOV_B32_]], %bb.0, %6, %bb.3
+  ; CHECK-NEXT:   [[PHI1:%[0-9]+]]:sreg_32 = PHI [[S_MOV_B32_]], %bb.0, %8, %bb.3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.3(0x40000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[BUFFER_LOAD_USHORT_OFFEN:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN [[COPY3]], [[COPY]], [[S_MOV_B32_1]], 0, 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[PHI]]
+  ; CHECK-NEXT:   [[BUFFER_LOAD_USHORT_OFFEN1:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN [[COPY4]], [[COPY]], [[S_MOV_B32_1]], 2, 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:vgpr_16 = COPY [[BUFFER_LOAD_USHORT_OFFEN]].lo16
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:vgpr_16 = COPY [[BUFFER_LOAD_USHORT_OFFEN1]].lo16
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
+  ; CHECK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 65535, implicit $exec
+  ; CHECK-NEXT:   [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 killed [[V_MOV_B32_e32_]], [[COPY7]], implicit $exec
+  ; CHECK-NEXT:   [[V_LSHL_OR_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHL_OR_B32_e64 [[COPY8]], 16, killed [[V_AND_B32_e64_]], implicit $exec
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:sgpr_lo16 = COPY [[PHI1]].lo16
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:vgpr_16 = COPY [[COPY9]]
+  ; CHECK-NEXT:   [[V_SUB_NC_U16_t16_e64_:%[0-9]+]]:vgpr_16 = V_SUB_NC_U16_t16_e64 0, [[COPY10]], 0, killed [[COPY5]], 0, 0, implicit $exec
+  ; CHECK-NEXT:   [[DEF:%[0-9]+]]:vgpr_16 = IMPLICIT_DEF
+  ; CHECK-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vgpr_32 = REG_SEQUENCE [[V_SUB_NC_U16_t16_e64_]], %subreg.lo16, [[DEF]], %subreg.hi16
+  ; CHECK-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]], implicit $exec
+  ; CHECK-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 255
+  ; CHECK-NEXT:   [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 killed [[V_READFIRSTLANE_B32_]], killed [[S_MOV_B32_2]], implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 12
+  ; CHECK-NEXT:   S_CMP_LT_I32 [[S_AND_B32_]], killed [[S_MOV_B32_3]], implicit-def $scc
+  ; CHECK-NEXT:   S_CBRANCH_SCC1 %bb.4, implicit $scc
+  ; CHECK-NEXT:   S_BRANCH %bb.3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.1(0x40000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
+  ; CHECK-NEXT:   [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; CHECK-NEXT:   [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 18
+  ; CHECK-NEXT:   S_CMP_LT_I32 [[S_AND_B32_]], killed [[S_MOV_B32_5]], implicit-def $scc
+  ; CHECK-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; CHECK-NEXT:   S_BRANCH %bb.4
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   S_ENDPGM 0
+  bb.1:
+    successors: %bb.3(0x80000000); %bb.3(100.00%)
+
+    %1:sgpr_128 = COPY undef %150:sgpr_128
+    %131:sreg_32 = S_MOV_B32 0
+    %2:sgpr_128 = COPY undef %151:sgpr_128
+    S_BRANCH %bb.3
+
+  bb.3:
+    successors: %bb.4(0x80000000); %bb.4(100.00%)
+
+    %3:sreg_32 = PHI %131:sreg_32, %bb.1, %183, %bb.5
+    %4:sreg_32 = PHI %131:sreg_32, %bb.1, %182, %bb.5
+
+  bb.4:
+    successors: %bb.6(0x40000000), %bb.5(0x40000000); %bb.5(50.00%), %bb.6(50.00%)
+
+    %154:sreg_32 = S_MOV_B32 0
+    %156:vgpr_32 = COPY %3:sreg_32
+    %162:vgpr_32 = COPY %3:sreg_32
+    %161:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN %162:vgpr_32, %1:sgpr_128, %154:sreg_32, 0, 0, 0, implicit $exec
+    %164:vgpr_32 = COPY %3:sreg_32
+    %163:vgpr_32 = BUFFER_LOAD_USHORT_OFFEN %164:vgpr_32, %1:sgpr_128, %154:sreg_32, 2, 0, 0, implicit $exec
+    %9:vgpr_16 = COPY %161.lo16:vgpr_32
+    %10:vgpr_16 = COPY %163.lo16:vgpr_32
+    %165:sreg_32 = COPY %9:vgpr_16
+    %166:sreg_32 = COPY %10:vgpr_16
+    %12:sreg_32 = S_PACK_LL_B32_B16 %165:sreg_32, %166:sreg_32
+    %167:sgpr_lo16 = COPY %4.lo16:sreg_32
+    %170:vgpr_16 = COPY %167:sgpr_lo16
+    %177:vgpr_16 = V_SUB_NC_U16_t16_e64 0, %170:vgpr_16, 0, killed %9:vgpr_16, 0, 0, implicit $exec
+    %179:sreg_32 = COPY killed %177:vgpr_16
+    %180:sreg_32 = S_MOV_B32 255
+    %13:sreg_32 = S_AND_B32 killed %179:sreg_32, killed %180:sreg_32, implicit-def dead $scc
+    %181:sreg_32 = S_MOV_B32 12
+    S_CMP_LT_I32 %13:sreg_32, killed %181:sreg_32, implicit-def $scc
+    S_CBRANCH_SCC1 %bb.6, implicit $scc
+    S_BRANCH %bb.5
+
+  bb.5:
+    successors: %bb.6(0x40000000), %bb.3(0x40000000); %bb.6(50.00%), %bb.3(50.00%)
+
+    %183:sreg_32 = S_MOV_B32 -1
+    %182:sreg_32 = IMPLICIT_DEF
+    %184:sreg_32 = S_MOV_B32 18
+    S_CMP_LT_I32 %13:sreg_32, killed %184:sreg_32, implicit-def $scc
+    S_CBRANCH_SCC1 %bb.3, implicit $scc
+    S_BRANCH %bb.6
+
+  bb.6:
+      S_ENDPGM 0
+
+...

@broxigarchen broxigarchen requested a review from Sisyph November 28, 2024 22:10
@Sisyph Sisyph requested a review from alex-t December 2, 2024 16:04
@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
if (SrcSize == 16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you reaching here in practice?

Copy link
Contributor Author

@broxigarchen broxigarchen Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. I think we probably never hit this before in the previous code. But with the true16 flow, we should hit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, we don't want SIFixSGPRCopies to need to handle anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. Can you help to elaborate a bit here? Do you think we should not see this 16bit handling in this pass? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does need to be handled here, but if you're seeing this in practice something else is likely missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code fixed the conformance test I was looking at, and as you say it should be handled here, so I think it should be merged.
Regarding an 'alternative fix', do you have any hypothesis what that would be? Is it a priority to find that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what it is. Either way that indicates we should have an end to end IR test for whatever this issue was as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR's decsription to answer this question

@broxigarchen broxigarchen changed the title [AMDGPU][True16][CodeGen]Support V2S copy with True16 inst format. [AMDGPU][True16][CodeGen]Support V2S copy with True16 flow Dec 5, 2024
@broxigarchen broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from 7e00db4 to bc0eb44 Compare January 13, 2025 22:49
@@ -1075,10 +1075,25 @@ void SIFixSGPRCopies::lowerVGPR2SGPRCopies(MachineFunction &MF) {
TRI->getRegClassForOperandReg(*MRI, MI->getOperand(1));
size_t SrcSize = TRI->getRegSizeInBits(*SrcRC);
if (SrcSize == 16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, we don't want SIFixSGPRCopies to need to handle anything

@broxigarchen broxigarchen marked this pull request as draft February 20, 2025 20:26
@broxigarchen broxigarchen reopened this Feb 26, 2025
@broxigarchen broxigarchen changed the title [AMDGPU][True16][CodeGen]Support V2S copy with True16 flow [AMDGPU][True16][CodeGen] fix-sgpr-copy pass update for 16bit sgpr in true16 flow Feb 26, 2025
@broxigarchen broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from bc0eb44 to 3b2e892 Compare February 26, 2025 19:00
@broxigarchen broxigarchen reopened this May 2, 2025
@broxigarchen broxigarchen force-pushed the main-merge-true16-fix-sgpr branch from 3b2e892 to 2973ed7 Compare May 2, 2025 16:32
@broxigarchen broxigarchen changed the title [AMDGPU][True16][CodeGen] fix-sgpr-copy pass update for 16bit sgpr in true16 flow [AMDGPU][True16][CodeGen] readfirstlane for vgpr16 copy to sgpr32 May 2, 2025
@broxigarchen broxigarchen marked this pull request as ready for review May 2, 2025 16:33
@broxigarchen broxigarchen requested a review from arsenm May 2, 2025 16:33
@broxigarchen
Copy link
Contributor Author

After investigation, we still need this patch for true16 mode. Updated and trimed this patch and updated the lit test for review

@broxigarchen broxigarchen requested a review from kosarev May 2, 2025 16:34
Copy link
Contributor

@Sisyph Sisyph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the IR test, it helps a lot. I would describe the situation as moving a uniform 16 bit value in a VGPR into an SGPR, when we don't have a readfirstlane_b16 instruction or a 16 bit SGPR. LGTM, but please wait for @arsenm.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with nit

@broxigarchen broxigarchen merged commit d4706e1 into llvm:main May 5, 2025
6 of 10 checks passed
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
…vm#118037)

i16 can be selected into sgpr32 or vgpr16 in isel lowering in true16
mode. And thus, it creates cases that we copy from vgpr16 to sgpr32 in
ext selection and this seems inevitable without sgpr16 support.

legalize the src/dst reg when we decide to lower this special copy to a
readfirstlane in fix-sgpr-copy pass and add a lit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants