Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[AMDGPU] Support bottom-up postRA scheduing. #135295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Apr 11, 2025

Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.

@harrisonGPU harrisonGPU requested a review from arsenm April 11, 2025 01:40
Copy link

github-actions bot commented Apr 11, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@harrisonGPU harrisonGPU marked this pull request as ready for review April 17, 2025 03:30
@llvmbot
Copy link
Member

llvmbot commented Apr 17, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.


Full diff: https://github.com/llvm/llvm-project/pull/135295.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+62-1)
  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir (+49-1)
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index aaefe27b1324f..205cb912126e7 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -284,6 +284,33 @@ void GCNHazardRecognizer::processBundle() {
   CurrCycleInstr = nullptr;
 }
 
+void GCNHazardRecognizer::processBundleBottomUp() {
+  // Step through each instruction in the bundle in bottom-up order.
+  MachineBasicBlock::instr_iterator MI =
+      std::next(CurrCycleInstr->getIterator());
+  MachineBasicBlock::instr_iterator E =
+      CurrCycleInstr->getParent()->instr_end();
+
+  // Evict stale entries to maintain a fixed lookahead window.
+  // TODO: Hazard detection is not yet implemented. This scheduling
+  // is intended for GFX11 and newer.
+  for (; MI != E && MI->isInsideBundle(); ++MI) {
+    CurrCycleInstr = &*MI;
+
+    // Remove up to (MaxLookAhead - 1) oldest entries.
+    for (unsigned I = 0, E = MaxLookAhead - 1; I < E && !EmittedInstrs.empty();
+         ++I)
+      EmittedInstrs.pop_back();
+
+    EmittedInstrs.push_back(CurrCycleInstr);
+
+    // Keep only the most recent MaxLookAhead entries
+    EmittedInstrs.resize(MaxLookAhead);
+  }
+
+  CurrCycleInstr = nullptr;
+}
+
 void GCNHazardRecognizer::runOnInstruction(MachineInstr *MI) {
   assert(IsHazardRecognizerMode);
 
@@ -423,7 +450,41 @@ void GCNHazardRecognizer::AdvanceCycle() {
 }
 
 void GCNHazardRecognizer::RecedeCycle() {
-  llvm_unreachable("hazard recognizer does not support bottom-up scheduling.");
+  // If no instruction was issued this cycle, pop the oldest placeholder.
+  if (!CurrCycleInstr) {
+    if (!EmittedInstrs.empty())
+      EmittedInstrs.pop_back();
+    return;
+  }
+
+  // If this is a bundle header, handle the entire bundle here.
+  if (CurrCycleInstr->isBundle()) {
+    processBundleBottomUp();
+    return;
+  }
+
+  unsigned NumWaitStates = TII.getNumWaitStates(*CurrCycleInstr);
+  if (!NumWaitStates) {
+    CurrCycleInstr = nullptr;
+    return;
+  }
+
+  // Add current instruction to the emitted list.
+  EmittedInstrs.push_back(CurrCycleInstr);
+
+  // Model remaining wait states by removing older placeholders.
+  for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
+       ++I) {
+    if (!EmittedInstrs.empty())
+      EmittedInstrs.pop_back();
+  }
+
+  // getMaxLookahead() is the largest number of wait states we will ever need
+  // to insert, so there is no point in keeping track of more than that many
+  // wait states.
+  EmittedInstrs.resize(getMaxLookAhead());
+
+  CurrCycleInstr = nullptr;
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index bbc55851bf967..88c7426be552d 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   // Advance over a MachineInstr bundle. Look for hazards in the bundled
   // instructions.
   void processBundle();
+  // Recede over a MachineInstr bundle. Adds bundled instructions to the
+  // EmittedInstrs queue in bottom-up scheduling mode.
+  // TODO: Hazard detection is not yet implemented.
+  void processBundleBottomUp();
 
   // Run on an individual instruction in hazard recognizer mode. This can be
   // used on a newly inserted instruction before returning from PreEmitNoops.
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..02ebffca84bda 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -1,5 +1,6 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=CHECK %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=CHECK-BOTTOMUP %s
 
 --- |
   define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
@@ -29,6 +30,21 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: no_sched_barrier
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -66,6 +82,22 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_0
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 0
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -105,6 +137,22 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_1
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 1
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {

@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
// Advance over a MachineInstr bundle. Look for hazards in the bundled
// instructions.
void processBundle();
// Recede over a MachineInstr bundle. Adds bundled instructions to the
// EmittedInstrs queue in bottom-up scheduling mode.
// TODO: Hazard detection is not yet implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something you can just leave for later. It's a problem if the hazard recognizer doesn't actually deal with the hazards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should I support all hazard recognizers? I think some older architectures may not need to be supported, since this scheduler is mainly used for graphics. We may only need to support GFX11 and above.
Also, I noticed that some hazard recognizers are specific to older architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.
This is a happy accident which makes this code correct for GFX11 onward, but ideally it should still be able to detect hazards to inform the scheduler, in case we need that functionality.

I don't think there is an issue with this code not dealing with hazards.
My understanding is that the PostRAHazardRecognizer pass ends up handling most hazards.
This always runs top-down using PreEmitNoops and AdvanceCycle.
It would make sense to add some kind of run-time trap to processBundleBottomUp or RecessCycle which fails if IsHazardRecognizerMode is set?

Copy link
Contributor Author

@harrisonGPU harrisonGPU Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I noticed that IsHazardRecognizerMode is not set in the post-RA scheduler; it’s only enabled in PostRAHazardRecognizer. I also agree that it’s not necessary to implement hazard handling in the post-RA scheduler at this point. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.

How did you infer that? Do you mean that none of the hazards that are handled by GCNHazardRecognizer::getHazardType are enabled on GFX11+? I don't really understand the different entrypoints into GCNHazardRecognizer and why they handle different sets of hazards.

But anyway not handling hazards in the post-RA scheduler should at least be safe, since they will all be handled later by the standalone hazard recognizer pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective the entrypoints are confusing, but essentially they related to whether GCNHazardRecognizer is running as part of scheduling or standalone.

In GCNHazardRecognizer::getHazardType the statement if (ST.hasNoDataDepHazard()) is always true on GFX10+, so the function exits early after only the following three checks:

  • checkSMRDHazards() only applies to SI, or soft clausing with XNACK (which is not present beyond GFX10.1).
  • ST.hasNSAtoVMEMBug() only applies to GFX10.1.
  • ST.hasFPAtomicToDenormModeHazard() only applies to GFX10.

Comment on lines +475 to +480
// Model remaining wait states by removing older placeholders.
for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
++I) {
if (!EmittedInstrs.empty())
EmittedInstrs.pop_back();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs more tests stressing the maximum lookahead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean adding some lit tests or testing with an application?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lit tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arsenm , When I run the following test:

; RUN: llc -march=amdgcn -mcpu=gfx1100 -debug-only=post-RA-sched -verify-machineinstrs < %s 2>&1 | FileCheck %s

define amdgpu_kernel void @hazard_test() {
entry:
  call void @llvm.amdgcn.s.nop(i16 5)
  call void @llvm.amdgcn.s.nop(i16 0)
  ret void
}

declare void @llvm.amdgcn.s.nop(i16)

It should trigger the maximum lookahead behavior. However, I don't see any log output related to maximum lookahead when using either -misched-postra-direction=topdown or -misched-postra-direction=bottomup.

Could you please give me some suggestions on how to add this test?

@harrisonGPU harrisonGPU requested review from arsenm and shiltian April 25, 2025 09:18
@harrisonGPU
Copy link
Contributor Author

ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants