-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AMDGPU] Support bottom-up postRA scheduing. #135295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
93b6595
to
8574e1f
Compare
8574e1f
to
84e6845
Compare
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesSolely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks. Full diff: https://github.com/llvm/llvm-project/pull/135295.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index aaefe27b1324f..205cb912126e7 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -284,6 +284,33 @@ void GCNHazardRecognizer::processBundle() {
CurrCycleInstr = nullptr;
}
+void GCNHazardRecognizer::processBundleBottomUp() {
+ // Step through each instruction in the bundle in bottom-up order.
+ MachineBasicBlock::instr_iterator MI =
+ std::next(CurrCycleInstr->getIterator());
+ MachineBasicBlock::instr_iterator E =
+ CurrCycleInstr->getParent()->instr_end();
+
+ // Evict stale entries to maintain a fixed lookahead window.
+ // TODO: Hazard detection is not yet implemented. This scheduling
+ // is intended for GFX11 and newer.
+ for (; MI != E && MI->isInsideBundle(); ++MI) {
+ CurrCycleInstr = &*MI;
+
+ // Remove up to (MaxLookAhead - 1) oldest entries.
+ for (unsigned I = 0, E = MaxLookAhead - 1; I < E && !EmittedInstrs.empty();
+ ++I)
+ EmittedInstrs.pop_back();
+
+ EmittedInstrs.push_back(CurrCycleInstr);
+
+ // Keep only the most recent MaxLookAhead entries
+ EmittedInstrs.resize(MaxLookAhead);
+ }
+
+ CurrCycleInstr = nullptr;
+}
+
void GCNHazardRecognizer::runOnInstruction(MachineInstr *MI) {
assert(IsHazardRecognizerMode);
@@ -423,7 +450,41 @@ void GCNHazardRecognizer::AdvanceCycle() {
}
void GCNHazardRecognizer::RecedeCycle() {
- llvm_unreachable("hazard recognizer does not support bottom-up scheduling.");
+ // If no instruction was issued this cycle, pop the oldest placeholder.
+ if (!CurrCycleInstr) {
+ if (!EmittedInstrs.empty())
+ EmittedInstrs.pop_back();
+ return;
+ }
+
+ // If this is a bundle header, handle the entire bundle here.
+ if (CurrCycleInstr->isBundle()) {
+ processBundleBottomUp();
+ return;
+ }
+
+ unsigned NumWaitStates = TII.getNumWaitStates(*CurrCycleInstr);
+ if (!NumWaitStates) {
+ CurrCycleInstr = nullptr;
+ return;
+ }
+
+ // Add current instruction to the emitted list.
+ EmittedInstrs.push_back(CurrCycleInstr);
+
+ // Model remaining wait states by removing older placeholders.
+ for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
+ ++I) {
+ if (!EmittedInstrs.empty())
+ EmittedInstrs.pop_back();
+ }
+
+ // getMaxLookahead() is the largest number of wait states we will ever need
+ // to insert, so there is no point in keeping track of more than that many
+ // wait states.
+ EmittedInstrs.resize(getMaxLookAhead());
+
+ CurrCycleInstr = nullptr;
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index bbc55851bf967..88c7426be552d 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
// Advance over a MachineInstr bundle. Look for hazards in the bundled
// instructions.
void processBundle();
+ // Recede over a MachineInstr bundle. Adds bundled instructions to the
+ // EmittedInstrs queue in bottom-up scheduling mode.
+ // TODO: Hazard detection is not yet implemented.
+ void processBundleBottomUp();
// Run on an individual instruction in hazard recognizer mode. This can be
// used on a newly inserted instruction before returning from PreEmitNoops.
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..02ebffca84bda 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -1,5 +1,6 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=CHECK %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=CHECK-BOTTOMUP %s
--- |
define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
@@ -29,6 +30,21 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: no_sched_barrier
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -66,6 +82,22 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_0
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 0
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -105,6 +137,22 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_1
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 1
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
|
@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer { | |||
// Advance over a MachineInstr bundle. Look for hazards in the bundled | |||
// instructions. | |||
void processBundle(); | |||
// Recede over a MachineInstr bundle. Adds bundled instructions to the | |||
// EmittedInstrs queue in bottom-up scheduling mode. | |||
// TODO: Hazard detection is not yet implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is something you can just leave for later. It's a problem if the hazard recognizer doesn't actually deal with the hazards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Should I support all hazard recognizers? I think some older architectures may not need to be supported, since this scheduler is mainly used for graphics. We may only need to support GFX11 and above.
Also, I noticed that some hazard recognizers are specific to older architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.
This is a happy accident which makes this code correct for GFX11 onward, but ideally it should still be able to detect hazards to inform the scheduler, in case we need that functionality.
I don't think there is an issue with this code not dealing with hazards.
My understanding is that the PostRAHazardRecognizer
pass ends up handling most hazards.
This always runs top-down using PreEmitNoops
and AdvanceCycle
.
It would make sense to add some kind of run-time trap to processBundleBottomUp
or RecessCycle
which fails if IsHazardRecognizerMode
is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I noticed that IsHazardRecognizerMode
is not set in the post-RA scheduler; it’s only enabled in PostRAHazardRecognizer. I also agree that it’s not necessary to implement hazard handling in the post-RA scheduler at this point. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point here is that from GFX11+ there are actually no hazards that are detected as part of MI scheduling.
How did you infer that? Do you mean that none of the hazards that are handled by GCNHazardRecognizer::getHazardType
are enabled on GFX11+? I don't really understand the different entrypoints into GCNHazardRecognizer
and why they handle different sets of hazards.
But anyway not handling hazards in the post-RA scheduler should at least be safe, since they will all be handled later by the standalone hazard recognizer pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective the entrypoints are confusing, but essentially they related to whether GCNHazardRecognizer is running as part of scheduling or standalone.
In GCNHazardRecognizer::getHazardType
the statement if (ST.hasNoDataDepHazard())
is always true on GFX10+, so the function exits early after only the following three checks:
checkSMRDHazards()
only applies to SI, or soft clausing with XNACK (which is not present beyond GFX10.1).ST.hasNSAtoVMEMBug()
only applies to GFX10.1.ST.hasFPAtomicToDenormModeHazard()
only applies to GFX10.
// Model remaining wait states by removing older placeholders. | ||
for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E; | ||
++I) { | ||
if (!EmittedInstrs.empty()) | ||
EmittedInstrs.pop_back(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs more tests stressing the maximum lookahead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean adding some lit tests or testing with an application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, lit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arsenm , When I run the following test:
; RUN: llc -march=amdgcn -mcpu=gfx1100 -debug-only=post-RA-sched -verify-machineinstrs < %s 2>&1 | FileCheck %s
define amdgpu_kernel void @hazard_test() {
entry:
call void @llvm.amdgcn.s.nop(i16 5)
call void @llvm.amdgcn.s.nop(i16 0)
ret void
}
declare void @llvm.amdgcn.s.nop(i16)
It should trigger the maximum lookahead behavior. However, I don't see any log output related to maximum lookahead when using either -misched-postra-direction=topdown
or -misched-postra-direction=bottomup
.
Could you please give me some suggestions on how to add this test?
ping. |
Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.