[AMDGPU] Support bottom-up postRA scheduing. #135295

harrisonGPU · 2025-04-11T01:40:37Z

Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.

github-actions · 2025-04-11T01:42:51Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvmbot · 2025-04-17T03:31:32Z

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.

Full diff: https://github.com/llvm/llvm-project/pull/135295.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+62-1)
(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+4)
(modified) llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir (+49-1)

diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index aaefe27b1324f..205cb912126e7 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -284,6 +284,33 @@ void GCNHazardRecognizer::processBundle() {
   CurrCycleInstr = nullptr;
 }
 
+void GCNHazardRecognizer::processBundleBottomUp() {
+  // Step through each instruction in the bundle in bottom-up order.
+  MachineBasicBlock::instr_iterator MI =
+      std::next(CurrCycleInstr->getIterator());
+  MachineBasicBlock::instr_iterator E =
+      CurrCycleInstr->getParent()->instr_end();
+
+  // Evict stale entries to maintain a fixed lookahead window.
+  // TODO: Hazard detection is not yet implemented. This scheduling
+  // is intended for GFX11 and newer.
+  for (; MI != E && MI->isInsideBundle(); ++MI) {
+    CurrCycleInstr = &*MI;
+
+    // Remove up to (MaxLookAhead - 1) oldest entries.
+    for (unsigned I = 0, E = MaxLookAhead - 1; I < E && !EmittedInstrs.empty();
+         ++I)
+      EmittedInstrs.pop_back();
+
+    EmittedInstrs.push_back(CurrCycleInstr);
+
+    // Keep only the most recent MaxLookAhead entries
+    EmittedInstrs.resize(MaxLookAhead);
+  }
+
+  CurrCycleInstr = nullptr;
+}
+
 void GCNHazardRecognizer::runOnInstruction(MachineInstr *MI) {
   assert(IsHazardRecognizerMode);
 
@@ -423,7 +450,41 @@ void GCNHazardRecognizer::AdvanceCycle() {
 }
 
 void GCNHazardRecognizer::RecedeCycle() {
-  llvm_unreachable("hazard recognizer does not support bottom-up scheduling.");
+  // If no instruction was issued this cycle, pop the oldest placeholder.
+  if (!CurrCycleInstr) {
+    if (!EmittedInstrs.empty())
+      EmittedInstrs.pop_back();
+    return;
+  }
+
+  // If this is a bundle header, handle the entire bundle here.
+  if (CurrCycleInstr->isBundle()) {
+    processBundleBottomUp();
+    return;
+  }
+
+  unsigned NumWaitStates = TII.getNumWaitStates(*CurrCycleInstr);
+  if (!NumWaitStates) {
+    CurrCycleInstr = nullptr;
+    return;
+  }
+
+  // Add current instruction to the emitted list.
+  EmittedInstrs.push_back(CurrCycleInstr);
+
+  // Model remaining wait states by removing older placeholders.
+  for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
+       ++I) {
+    if (!EmittedInstrs.empty())
+      EmittedInstrs.pop_back();
+  }
+
+  // getMaxLookahead() is the largest number of wait states we will ever need
+  // to insert, so there is no point in keeping track of more than that many
+  // wait states.
+  EmittedInstrs.resize(getMaxLookAhead());
+
+  CurrCycleInstr = nullptr;
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index bbc55851bf967..88c7426be552d 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   // Advance over a MachineInstr bundle. Look for hazards in the bundled
   // instructions.
   void processBundle();
+  // Recede over a MachineInstr bundle. Adds bundled instructions to the
+  // EmittedInstrs queue in bottom-up scheduling mode.
+  // TODO: Hazard detection is not yet implemented.
+  void processBundleBottomUp();
 
   // Run on an individual instruction in hazard recognizer mode. This can be
   // used on a newly inserted instruction before returning from PreEmitNoops.
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..02ebffca84bda 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -1,5 +1,6 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=CHECK %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=CHECK-BOTTOMUP %s
 
 --- |
   define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
@@ -29,6 +30,21 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: no_sched_barrier
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -66,6 +82,22 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_0
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 0
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -105,6 +137,22 @@ body: |
     ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
     ; CHECK-NEXT: }
     ; CHECK-NEXT: S_ENDPGM 0
+    ;
+    ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_1
+    ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 1
+    ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-BOTTOMUP-NEXT: }
+    ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
     renamable $sgpr0_sgpr1 = IMPLICIT_DEF
     renamable $vgpr0 = IMPLICIT_DEF
     BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {

arsenm · 2025-04-17T12:15:57Z

llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h

@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
  // Advance over a MachineInstr bundle. Look for hazards in the bundled
  // instructions.
  void processBundle();
+  // Recede over a MachineInstr bundle. Adds bundled instructions to the
+  // EmittedInstrs queue in bottom-up scheduling mode.
+  // TODO: Hazard detection is not yet implemented.


I don't think this is something you can just leave for later. It's a problem if the hazard recognizer doesn't actually deal with the hazards

Thanks! Should I support all hazard recognizers? I think some older architectures may not need to be supported, since this scheduler is mainly used for graphics. We may only need to support GFX11 and above.
Also, I noticed that some hazard recognizers are specific to older architectures.

I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.
This is a happy accident which makes this code correct for GFX11 onward, but ideally it should still be able to detect hazards to inform the scheduler, in case we need that functionality.

I don't think there is an issue with this code not dealing with hazards.
My understanding is that the PostRAHazardRecognizer pass ends up handling most hazards.
This always runs top-down using PreEmitNoops and AdvanceCycle.
It would make sense to add some kind of run-time trap to processBundleBottomUp or RecessCycle which fails if IsHazardRecognizerMode is set?

Thanks! I noticed that IsHazardRecognizerMode is not set in the post-RA scheduler; it’s only enabled in PostRAHazardRecognizer. I also agree that it’s not necessary to implement hazard handling in the post-RA scheduler at this point. :-)

I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.

How did you infer that? Do you mean that none of the hazards that are handled by GCNHazardRecognizer::getHazardType are enabled on GFX11+? I don't really understand the different entrypoints into GCNHazardRecognizer and why they handle different sets of hazards.

But anyway not handling hazards in the post-RA scheduler should at least be safe, since they will all be handled later by the standalone hazard recognizer pass.

From my perspective the entrypoints are confusing, but essentially they related to whether GCNHazardRecognizer is running as part of scheduling or standalone.

In GCNHazardRecognizer::getHazardType the statement if (ST.hasNoDataDepHazard()) is always true on GFX10+, so the function exits early after only the following three checks:

checkSMRDHazards() only applies to SI, or soft clausing with XNACK (which is not present beyond GFX10.1).

ST.hasNSAtoVMEMBug() only applies to GFX10.1.

ST.hasFPAtomicToDenormModeHazard() only applies to GFX10.

arsenm · 2025-04-17T12:16:32Z

llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp

+  // Model remaining wait states by removing older placeholders.
+  for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
+       ++I) {
+    if (!EmittedInstrs.empty())
+      EmittedInstrs.pop_back();
+  }


Probably needs more tests stressing the maximum lookahead

Do you mean adding some lit tests or testing with an application?

Yes, lit tests

Hi @arsenm , When I run the following test:

; RUN: llc -march=amdgcn -mcpu=gfx1100 -debug-only=post-RA-sched -verify-machineinstrs < %s 2>&1 | FileCheck %s define amdgpu_kernel void @hazard_test() { entry: call void @llvm.amdgcn.s.nop(i16 5) call void @llvm.amdgcn.s.nop(i16 0) ret void } declare void @llvm.amdgcn.s.nop(i16)

It should trigger the maximum lookahead behavior. However, I don't see any log output related to maximum lookahead when using either -misched-postra-direction=topdown or -misched-postra-direction=bottomup.

Could you please give me some suggestions on how to add this test?

harrisonGPU · 2025-05-05T08:39:48Z

ping.

harrisonGPU requested a review from arsenm April 11, 2025 01:40

harrisonGPU force-pushed the amdgpu-postRA branch from 93b6595 to 8574e1f Compare April 11, 2025 03:01

harrisonGPU requested review from nhaehnle, jayfoad and ruiling April 11, 2025 04:17

dstutt requested a review from perlfu April 16, 2025 08:32

harrisonGPU added 2 commits April 17, 2025 10:21

[AMDGPU] Support bottom-up postRA scheduling.

64c887f

[AMDGPU] Update comments.

84e6845

harrisonGPU force-pushed the amdgpu-postRA branch from 8574e1f to 84e6845 Compare April 17, 2025 03:11

harrisonGPU marked this pull request as ready for review April 17, 2025 03:30

llvmbot added the backend:AMDGPU label Apr 17, 2025

arsenm reviewed Apr 17, 2025

View reviewed changes

[AMDGPU] Add assert for Hazard for bottom up.

4323b2b

harrisonGPU requested review from arsenm and shiltian April 25, 2025 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Support bottom-up postRA scheduing. #135295

[AMDGPU] Support bottom-up postRA scheduing. #135295

harrisonGPU commented Apr 11, 2025 •

edited

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

llvmbot commented Apr 17, 2025

arsenm Apr 17, 2025

harrisonGPU Apr 17, 2025

arsenm Apr 18, 2025

perlfu Apr 20, 2025

harrisonGPU Apr 20, 2025 •

edited

Loading

jayfoad Apr 28, 2025

perlfu Apr 29, 2025

arsenm Apr 17, 2025

harrisonGPU Apr 17, 2025

arsenm Apr 18, 2025

harrisonGPU Apr 25, 2025

harrisonGPU commented May 5, 2025

[AMDGPU] Support bottom-up postRA scheduing. #135295

Are you sure you want to change the base?

[AMDGPU] Support bottom-up postRA scheduing. #135295

Conversation

harrisonGPU commented Apr 11, 2025 • edited Loading

github-actions bot commented Apr 11, 2025 • edited Loading

llvmbot commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrisonGPU Apr 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrisonGPU commented May 5, 2025

harrisonGPU commented Apr 11, 2025 •

edited

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

harrisonGPU Apr 20, 2025 •

edited

Loading