-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[RISCV] Add scheduler definitions for SpacemiT-X60 #137343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Mikhail R. Gadelha (mikhailramalho) ChangesThis patch adds an initial scheduler model for the SpacemiT-X60, including latency for scalar instructions only. The scheduler is based on the documented characteristics of the C908, which the SpacemiT-X60 is believed to be based on, and provides the expected latency for several instructions. I ran llvm-exegesis to confirm most of these values and to get the latency of instructions not provided by the C908 documentation (e.g., double floating-point instructions). For load and store instructions, the C908 documentation says the latency is >= 3 for load and 1 for store. I tried a few combinations of values until I got the current values of 5 and 3, which yield the best results. Although the X60 does appear to support multiple issue for at least some floating point instructions, this model assumes single issue as increasing it reduces the gains below. This patch gives a geomean improvement of ~4% on SPEC CPU 2017 for both rva22u64 and rva22u64_v, with some benchmarks improving up to 15% (525.x264_r, 508.namd_r). There were no execution time regressions detected.
This initial scheduling model is strongly focused on providing sufficient definitions to provide improved performance for the SpacemiT-X60. Further incremental gains may be possible through a much more detailed microarchitectural analysis, but that is left to future work. Further scheduling definitions for RVV can be added in a future PR. Patch is 68.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137343.diff 7 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCV.td b/llvm/lib/Target/RISCV/RISCV.td
index 2c2271e486a84..6a6cec88b74a4 100644
--- a/llvm/lib/Target/RISCV/RISCV.td
+++ b/llvm/lib/Target/RISCV/RISCV.td
@@ -57,6 +57,7 @@ include "RISCVSchedSyntacoreSCR345.td"
include "RISCVSchedSyntacoreSCR7.td"
include "RISCVSchedTTAscalonD8.td"
include "RISCVSchedXiangShanNanHu.td"
+include "RISCVSchedSpacemitX60.td"
//===----------------------------------------------------------------------===//
// RISC-V processors supported.
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 9d48adeec5e86..6e44518cb43f2 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -559,7 +559,7 @@ def XIANGSHAN_NANHU : RISCVProcessorModel<"xiangshan-nanhu",
TuneShiftedZExtWFusion]>;
def SPACEMIT_X60 : RISCVProcessorModel<"spacemit-x60",
- NoSchedModel,
+ SpacemitX60Model,
!listconcat(RVA22S64Features,
[FeatureStdExtV,
FeatureStdExtSscofpmf,
diff --git a/llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td b/llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td
new file mode 100644
index 0000000000000..d1148cc2f69dc
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVSchedSpacemitX60.td
@@ -0,0 +1,332 @@
+//=- RISCVSchedSpacemitX60.td - Spacemit X60 Scheduling Defs -*- tablegen -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+//===----------------------------------------------------------------------===//
+//
+// Scheduler model for the SpacemiT-X60 processor based on documentation of the
+// C908 and experiments on real hardware (bpi-f3).
+//
+//===----------------------------------------------------------------------===//
+
+def SpacemitX60Model : SchedMachineModel {
+ let IssueWidth = 2; // dual-issue
+ let MicroOpBufferSize = 0; // in-order
+ let LoadLatency = 5; // worse case: >= 3
+ let MispredictPenalty = 9; // nine-stage
+
+ let CompleteModel = 0;
+
+ let UnsupportedFeatures = [HasStdExtZknd, HasStdExtZkne, HasStdExtZknh,
+ HasStdExtZksed, HasStdExtZksh, HasStdExtZkr];
+}
+
+let SchedModel = SpacemitX60Model in {
+
+//===----------------------------------------------------------------------===//
+// Define processor resources for Spacemit-X60
+
+// Information gathered from the C908 user manual:
+let BufferSize = 0 in {
+ // The LSU supports dual issue for scalar store/load instructions
+ def SMX60_LS : ProcResource<2>;
+
+ // An IEU can decode and issue two instructions at the same time
+ def SMX60_IEU : ProcResource<2>;
+
+ def SMX60_FP : ProcResource<1>;
+}
+
+//===----------------------------------------------------------------------===//
+
+// Branching
+def : WriteRes<WriteJmp, [SMX60_IEU]>;
+def : WriteRes<WriteJal, [SMX60_IEU]>;
+def : WriteRes<WriteJalr, [SMX60_IEU]>;
+
+// Integer arithmetic and logic
+def : WriteRes<WriteIALU32, [SMX60_IEU]>;
+def : WriteRes<WriteIALU, [SMX60_IEU]>;
+def : WriteRes<WriteShiftImm32, [SMX60_IEU]>;
+def : WriteRes<WriteShiftImm, [SMX60_IEU]>;
+def : WriteRes<WriteShiftReg32, [SMX60_IEU]>;
+def : WriteRes<WriteShiftReg, [SMX60_IEU]>;
+
+// Integer multiplication
+let Latency = 4 in {
+ def : WriteRes<WriteIMul, [SMX60_IEU]>;
+ def : WriteRes<WriteIMul32, [SMX60_IEU]>;
+}
+
+// Integer division/remainder
+// Worst case latency is used.
+def : WriteRes<WriteIDiv32, [SMX60_IEU]> { let Latency = 12; }
+def : WriteRes<WriteIDiv, [SMX60_IEU]> { let Latency = 20; }
+def : WriteRes<WriteIRem32, [SMX60_IEU]> { let Latency = 12; }
+def : WriteRes<WriteIRem, [SMX60_IEU]> { let Latency = 20; }
+
+// Bitmanip
+def : WriteRes<WriteRotateImm, [SMX60_IEU]>;
+def : WriteRes<WriteRotateImm32, [SMX60_IEU]>;
+def : WriteRes<WriteRotateReg, [SMX60_IEU]>;
+def : WriteRes<WriteRotateReg32, [SMX60_IEU]>;
+
+def : WriteRes<WriteCLZ, [SMX60_IEU]>;
+def : WriteRes<WriteCLZ32, [SMX60_IEU]>;
+def : WriteRes<WriteCTZ, [SMX60_IEU]>;
+def : WriteRes<WriteCTZ32, [SMX60_IEU]>;
+
+def : WriteRes<WriteCPOP, [SMX60_IEU]>;
+def : WriteRes<WriteCPOP32, [SMX60_IEU]>;
+
+def : WriteRes<WriteORCB, [SMX60_IEU]>;
+
+def : WriteRes<WriteIMinMax, [SMX60_IEU]>;
+
+def : WriteRes<WriteREV8, [SMX60_IEU]>;
+
+def : WriteRes<WriteSHXADD, [SMX60_IEU]>;
+def : WriteRes<WriteSHXADD32, [SMX60_IEU]>;
+
+// Single-bit instructions
+def : WriteRes<WriteSingleBit, [SMX60_IEU]>;
+def : WriteRes<WriteSingleBitImm, [SMX60_IEU]>;
+def : WriteRes<WriteBEXT, [SMX60_IEU]>;
+def : WriteRes<WriteBEXTI, [SMX60_IEU]>;
+
+// Memory/Atomic memory
+let Latency = 3 in {
+ def : WriteRes<WriteSTB, [SMX60_LS]>;
+ def : WriteRes<WriteSTH, [SMX60_LS]>;
+ def : WriteRes<WriteSTW, [SMX60_LS]>;
+ def : WriteRes<WriteSTD, [SMX60_LS]>;
+ def : WriteRes<WriteFST16, [SMX60_LS]>;
+ def : WriteRes<WriteFST32, [SMX60_LS]>;
+ def : WriteRes<WriteFST64, [SMX60_LS]>;
+ def : WriteRes<WriteAtomicSTW, [SMX60_LS]>;
+ def : WriteRes<WriteAtomicSTD, [SMX60_LS]>;
+}
+
+let Latency = 5 in {
+ def : WriteRes<WriteLDB, [SMX60_LS]>;
+ def : WriteRes<WriteLDH, [SMX60_LS]>;
+ def : WriteRes<WriteLDW, [SMX60_LS]>;
+ def : WriteRes<WriteLDD, [SMX60_LS]>;
+ def : WriteRes<WriteFLD16, [SMX60_LS]>;
+ def : WriteRes<WriteFLD32, [SMX60_LS]>;
+ def : WriteRes<WriteFLD64, [SMX60_LS]>;
+}
+
+// Atomics
+let Latency = 5 in {
+ def : WriteRes<WriteAtomicLDW, [SMX60_LS]>;
+ def : WriteRes<WriteAtomicLDD, [SMX60_LS]>;
+ def : WriteRes<WriteAtomicW, [SMX60_LS]>;
+ def : WriteRes<WriteAtomicD, [SMX60_LS]>;
+}
+
+// Floating point units Half precision
+def : WriteRes<WriteFAdd16, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMul16, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMA16, [SMX60_FP]> { let Latency = 4; }
+def : WriteRes<WriteFSGNJ16, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMinMax16, [SMX60_FP]> { let Latency = 3; }
+
+// Worst case latency is used
+let Latency = 7, ReleaseAtCycles = [7] in {
+ def : WriteRes<WriteFDiv16, [SMX60_FP]>;
+ def : WriteRes<WriteFSqrt16, [SMX60_FP]>;
+}
+
+// Single precision
+def : WriteRes<WriteFAdd32, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMul32, [SMX60_FP]> { let Latency = 4; }
+def : WriteRes<WriteFMA32, [SMX60_FP]> { let Latency = 5; }
+def : WriteRes<WriteFSGNJ32, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMinMax32, [SMX60_FP]> { let Latency = 3; }
+
+// Worst case latency is used
+let Latency = 10, ReleaseAtCycles = [10] in {
+ def : WriteRes<WriteFDiv32, [SMX60_FP]>;
+ def : WriteRes<WriteFSqrt32, [SMX60_FP]>;
+}
+
+// Double precision
+def : WriteRes<WriteFAdd64, [SMX60_FP]> { let Latency = 4; }
+def : WriteRes<WriteFMul64, [SMX60_FP]> { let Latency = 4; }
+def : WriteRes<WriteFMA64, [SMX60_FP]> { let Latency = 5; }
+def : WriteRes<WriteFSGNJ64, [SMX60_FP]> { let Latency = 3; }
+def : WriteRes<WriteFMinMax64, [SMX60_FP]> { let Latency = 3; }
+
+let Latency = 10, ReleaseAtCycles = [10] in {
+ def : WriteRes<WriteFDiv64, [SMX60_FP]>;
+ def : WriteRes<WriteFSqrt64, [SMX60_FP]>;
+}
+
+// Conversions
+let Latency = 3 in {
+ def : WriteRes<WriteFCvtI32ToF16, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtI32ToF32, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtI32ToF64, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtI64ToF16, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtI64ToF32, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtI64ToF64, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF16ToI32, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF16ToI64, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF16ToF32, [SMX60_FP]>;
+ def : WriteRes<WriteFCvtF16ToF64, [SMX60_FP]>;
+ def : WriteRes<WriteFCvtF32ToI32, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF32ToI64, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF32ToF16, [SMX60_FP]>;
+ def : WriteRes<WriteFCvtF32ToF64, [SMX60_FP]>;
+ def : WriteRes<WriteFCvtF64ToI32, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF64ToI64, [SMX60_IEU]>;
+ def : WriteRes<WriteFCvtF64ToF16, [SMX60_FP]>;
+ def : WriteRes<WriteFCvtF64ToF32, [SMX60_FP]>;
+}
+
+let Latency = 2 in {
+ def : WriteRes<WriteFClass16, [SMX60_FP]>;
+ def : WriteRes<WriteFClass32, [SMX60_FP]>;
+ def : WriteRes<WriteFClass64, [SMX60_FP]>;
+}
+
+let Latency = 4 in {
+ def : WriteRes<WriteFCmp16, [SMX60_FP]>;
+ def : WriteRes<WriteFCmp32, [SMX60_FP]>;
+ def : WriteRes<WriteFCmp64, [SMX60_FP]>;
+}
+
+let Latency = 2 in {
+ def : WriteRes<WriteFMovI16ToF16, [SMX60_IEU]>;
+ def : WriteRes<WriteFMovF16ToI16, [SMX60_IEU]>;
+ def : WriteRes<WriteFMovI32ToF32, [SMX60_IEU]>;
+ def : WriteRes<WriteFMovF32ToI32, [SMX60_IEU]>;
+ def : WriteRes<WriteFMovI64ToF64, [SMX60_IEU]>;
+ def : WriteRes<WriteFMovF64ToI64, [SMX60_IEU]>;
+}
+
+// Others
+def : WriteRes<WriteCSR, [SMX60_IEU]>;
+def : WriteRes<WriteNop, [SMX60_IEU]>;
+
+//===----------------------------------------------------------------------===//
+// Bypass and advance
+def : ReadAdvance<ReadJmp, 0>;
+def : ReadAdvance<ReadJalr, 0>;
+def : ReadAdvance<ReadCSR, 0>;
+def : ReadAdvance<ReadStoreData, 0>;
+def : ReadAdvance<ReadMemBase, 0>;
+def : ReadAdvance<ReadIALU, 0>;
+def : ReadAdvance<ReadIALU32, 0>;
+def : ReadAdvance<ReadShiftImm, 0>;
+def : ReadAdvance<ReadShiftImm32, 0>;
+def : ReadAdvance<ReadShiftReg, 0>;
+def : ReadAdvance<ReadShiftReg32, 0>;
+def : ReadAdvance<ReadIDiv, 0>;
+def : ReadAdvance<ReadIDiv32, 0>;
+def : ReadAdvance<ReadIRem, 0>;
+def : ReadAdvance<ReadIRem32, 0>;
+def : ReadAdvance<ReadIMul, 0>;
+def : ReadAdvance<ReadIMul32, 0>;
+def : ReadAdvance<ReadAtomicWA, 0>;
+def : ReadAdvance<ReadAtomicWD, 0>;
+def : ReadAdvance<ReadAtomicDA, 0>;
+def : ReadAdvance<ReadAtomicDD, 0>;
+def : ReadAdvance<ReadAtomicLDW, 0>;
+def : ReadAdvance<ReadAtomicLDD, 0>;
+def : ReadAdvance<ReadAtomicSTW, 0>;
+def : ReadAdvance<ReadAtomicSTD, 0>;
+def : ReadAdvance<ReadFStoreData, 0>;
+def : ReadAdvance<ReadFMemBase, 0>;
+def : ReadAdvance<ReadFAdd16, 0>;
+def : ReadAdvance<ReadFAdd32, 0>;
+def : ReadAdvance<ReadFAdd64, 0>;
+def : ReadAdvance<ReadFMul16, 0>;
+def : ReadAdvance<ReadFMA16, 0>;
+def : ReadAdvance<ReadFMA16Addend, 0>;
+def : ReadAdvance<ReadFMul32, 0>;
+def : ReadAdvance<ReadFMul64, 0>;
+def : ReadAdvance<ReadFMA32, 0>;
+def : ReadAdvance<ReadFMA32Addend, 0>;
+def : ReadAdvance<ReadFMA64, 0>;
+def : ReadAdvance<ReadFMA64Addend, 0>;
+def : ReadAdvance<ReadFDiv16, 0>;
+def : ReadAdvance<ReadFDiv32, 0>;
+def : ReadAdvance<ReadFDiv64, 0>;
+def : ReadAdvance<ReadFSqrt16, 0>;
+def : ReadAdvance<ReadFSqrt32, 0>;
+def : ReadAdvance<ReadFSqrt64, 0>;
+def : ReadAdvance<ReadFCmp16, 0>;
+def : ReadAdvance<ReadFCmp32, 0>;
+def : ReadAdvance<ReadFCmp64, 0>;
+def : ReadAdvance<ReadFSGNJ16, 0>;
+def : ReadAdvance<ReadFSGNJ32, 0>;
+def : ReadAdvance<ReadFSGNJ64, 0>;
+def : ReadAdvance<ReadFMinMax16, 0>;
+def : ReadAdvance<ReadFMinMax32, 0>;
+def : ReadAdvance<ReadFMinMax64, 0>;
+def : ReadAdvance<ReadFCvtF16ToI32, 0>;
+def : ReadAdvance<ReadFCvtF16ToI64, 0>;
+def : ReadAdvance<ReadFCvtF32ToI32, 0>;
+def : ReadAdvance<ReadFCvtF32ToI64, 0>;
+def : ReadAdvance<ReadFCvtF64ToI32, 0>;
+def : ReadAdvance<ReadFCvtF64ToI64, 0>;
+def : ReadAdvance<ReadFCvtI32ToF16, 0>;
+def : ReadAdvance<ReadFCvtI32ToF32, 0>;
+def : ReadAdvance<ReadFCvtI32ToF64, 0>;
+def : ReadAdvance<ReadFCvtI64ToF16, 0>;
+def : ReadAdvance<ReadFCvtI64ToF32, 0>;
+def : ReadAdvance<ReadFCvtI64ToF64, 0>;
+def : ReadAdvance<ReadFCvtF32ToF64, 0>;
+def : ReadAdvance<ReadFCvtF64ToF32, 0>;
+def : ReadAdvance<ReadFCvtF16ToF32, 0>;
+def : ReadAdvance<ReadFCvtF32ToF16, 0>;
+def : ReadAdvance<ReadFCvtF16ToF64, 0>;
+def : ReadAdvance<ReadFCvtF64ToF16, 0>;
+def : ReadAdvance<ReadFMovF16ToI16, 0>;
+def : ReadAdvance<ReadFMovI16ToF16, 0>;
+def : ReadAdvance<ReadFMovF32ToI32, 0>;
+def : ReadAdvance<ReadFMovI32ToF32, 0>;
+def : ReadAdvance<ReadFMovF64ToI64, 0>;
+def : ReadAdvance<ReadFMovI64ToF64, 0>;
+def : ReadAdvance<ReadFClass16, 0>;
+def : ReadAdvance<ReadFClass32, 0>;
+def : ReadAdvance<ReadFClass64, 0>;
+
+// Bitmanip
+def : ReadAdvance<ReadRotateImm, 0>;
+def : ReadAdvance<ReadRotateImm32, 0>;
+def : ReadAdvance<ReadRotateReg, 0>;
+def : ReadAdvance<ReadRotateReg32, 0>;
+def : ReadAdvance<ReadCLZ, 0>;
+def : ReadAdvance<ReadCLZ32, 0>;
+def : ReadAdvance<ReadCTZ, 0>;
+def : ReadAdvance<ReadCTZ32, 0>;
+def : ReadAdvance<ReadCPOP, 0>;
+def : ReadAdvance<ReadCPOP32, 0>;
+def : ReadAdvance<ReadORCB, 0>;
+def : ReadAdvance<ReadIMinMax, 0>;
+def : ReadAdvance<ReadREV8, 0>;
+def : ReadAdvance<ReadSHXADD, 0>;
+def : ReadAdvance<ReadSHXADD32, 0>;
+// Single-bit instructions
+def : ReadAdvance<ReadSingleBit, 0>;
+def : ReadAdvance<ReadSingleBitImm, 0>;
+
+//===----------------------------------------------------------------------===//
+// Unsupported extensions
+defm : UnsupportedSchedV;
+defm : UnsupportedSchedXsfvcp;
+defm : UnsupportedSchedZabha;
+defm : UnsupportedSchedZbc;
+defm : UnsupportedSchedZbkb;
+defm : UnsupportedSchedZbkx;
+defm : UnsupportedSchedZfa;
+defm : UnsupportedSchedZvk;
+defm : UnsupportedSchedSFB;
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll
index 75f4b977a98b0..b384a0187a1ce 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll
@@ -302,32 +302,32 @@ define void @test1(ptr nocapture noundef writeonly %dst, i32 noundef signext %i_
; RV64X60-NEXT: .cfi_offset s4, -40
; RV64X60-NEXT: li t0, 0
; RV64X60-NEXT: li t1, 0
-; RV64X60-NEXT: addi t2, a7, -1
-; RV64X60-NEXT: add t4, a0, a6
-; RV64X60-NEXT: add t5, a2, a6
-; RV64X60-NEXT: add t3, a4, a6
-; RV64X60-NEXT: zext.w s0, t2
-; RV64X60-NEXT: mul s1, a1, s0
-; RV64X60-NEXT: add t4, t4, s1
-; RV64X60-NEXT: mul s1, a3, s0
-; RV64X60-NEXT: add t5, t5, s1
+; RV64X60-NEXT: addi s1, a7, -1
+; RV64X60-NEXT: zext.w s1, s1
+; RV64X60-NEXT: mul t2, a1, s1
+; RV64X60-NEXT: mul t3, a3, s1
+; RV64X60-NEXT: mul t4, a5, s1
+; RV64X60-NEXT: add s1, a0, a6
+; RV64X60-NEXT: add s0, a2, a6
+; RV64X60-NEXT: add t5, a4, a6
+; RV64X60-NEXT: add s2, s1, t2
; RV64X60-NEXT: csrr t2, vlenb
-; RV64X60-NEXT: mul s1, a5, s0
-; RV64X60-NEXT: add t3, t3, s1
-; RV64X60-NEXT: sltu s1, a0, t5
-; RV64X60-NEXT: sltu s0, a2, t4
-; RV64X60-NEXT: and t6, s1, s0
+; RV64X60-NEXT: add t3, t3, s0
+; RV64X60-NEXT: or t6, a1, a3
+; RV64X60-NEXT: add t4, t4, t5
+; RV64X60-NEXT: sltu s0, a0, t3
+; RV64X60-NEXT: sltu s1, a2, s2
+; RV64X60-NEXT: and t5, s0, s1
+; RV64X60-NEXT: slli t3, t2, 1
+; RV64X60-NEXT: slti s1, t6, 0
+; RV64X60-NEXT: sltu s0, a0, t4
+; RV64X60-NEXT: or t4, t5, s1
+; RV64X60-NEXT: sltu s1, a4, s2
+; RV64X60-NEXT: and s0, s0, s1
+; RV64X60-NEXT: or s1, a1, a5
; RV64X60-NEXT: li t5, 32
-; RV64X60-NEXT: sltu s1, a0, t3
-; RV64X60-NEXT: sltu s0, a4, t4
-; RV64X60-NEXT: and t3, s1, s0
-; RV64X60-NEXT: or s1, a1, a3
; RV64X60-NEXT: slti s1, s1, 0
-; RV64X60-NEXT: or t4, t6, s1
-; RV64X60-NEXT: or s0, a1, a5
-; RV64X60-NEXT: slti s0, s0, 0
-; RV64X60-NEXT: or s0, t3, s0
-; RV64X60-NEXT: slli t3, t2, 1
+; RV64X60-NEXT: or s0, s0, s1
; RV64X60-NEXT: maxu s1, t3, t5
; RV64X60-NEXT: or s0, t4, s0
; RV64X60-NEXT: sltu s1, a6, s1
@@ -339,8 +339,8 @@ define void @test1(ptr nocapture noundef writeonly %dst, i32 noundef signext %i_
; RV64X60-NEXT: # in Loop: Header=BB0_4 Depth=1
; RV64X60-NEXT: add t5, t5, a1
; RV64X60-NEXT: add a2, a2, a3
-; RV64X60-NEXT: add a4, a4, a5
; RV64X60-NEXT: addiw t1, t1, 1
+; RV64X60-NEXT: add a4, a4, a5
; RV64X60-NEXT: addi t0, t0, 1
; RV64X60-NEXT: beq t1, a7, .LBB0_11
; RV64X60-NEXT: .LBB0_4: # %for.cond1.preheader.us
@@ -367,10 +367,10 @@ define void @test1(ptr nocapture noundef writeonly %dst, i32 noundef signext %i_
; RV64X60-NEXT: vl2r.v v8, (s2)
; RV64X60-NEXT: vl2r.v v10, (s3)
; RV64X60-NEXT: sub s1, s1, t3
-; RV64X60-NEXT: add s3, s3, t3
; RV64X60-NEXT: vaaddu.vv v8, v8, v10
; RV64X60-NEXT: vs2r.v v8, (s4)
; RV64X60-NEXT: add s4, s4, t3
+; RV64X60-NEXT: add s3, s3, t3
; RV64X60-NEXT: add s2, s2, t3
; RV64X60-NEXT: bnez s1, .LBB0_7
; RV64X60-NEXT: # %bb.8: # %middle.block
diff --git a/llvm/test/tools/llvm-mca/RISCV/SpacemitX60/atomic.s b/llvm/test/tools/llvm-mca/RISCV/SpacemitX60/atomic.s
new file mode 100644
index 0000000000000..73109a78cd4b9
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/RISCV/SpacemitX60/atomic.s
@@ -0,0 +1,312 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=riscv64 -mattr=+rva22u64 -mcpu=spacemit-x60 -iterations=1 < %s | FileCheck %s
+
+# Zalrsc
+lr.w t0, (t1)
+lr.w.aq t1, (t2)
+lr.w.rl t2, (t3)
+lr.w.aqrl t3, (t4)
+sc.w t6, t5, (t4)
+sc.w.aq t5, t4, (t3)
+sc.w.rl t4, t3, (t2)
+sc.w.aqrl t3, t2, (t1)
+
+lr.d t0, (t1)
+lr.d.aq t1, (t2)
+lr.d.rl t2, (t3)
+lr.d.aqrl t3, (t4)
+sc.d t6, t5, (t4)
+sc.d.aq t5, t4, (t3)
+sc.d.rl t4, t3, (t2)
+sc.d.aqrl t3, t2, (t1)
+
+# Zaamo
+amoswap.w a4, ra, (s0)
+amoadd.w a1, a2, (a3)
+amoxor.w a2, a3, (a4)
+amoand.w a3, a4, (a5)
+amoor.w a4, a5, (a6)
+amomin.w a5, a6, (a7)
+amomax.w s7, s6, (s5)
+amominu.w s6, s5, (s4)
+amomaxu.w s5, s4, (s3)
+
+amoswap.w.aq a4, ra, (s0)
+amoadd.w.aq a1, a2, (a3)
+amoxor.w.aq a2, a3, (a4)
+amoand.w.aq a3, a4, (a5)
+amoor.w.aq a4, a5, (a6)
+amomin.w.aq a5, a6, (a7)
+amomax.w.aq s7, s6, (s5)
+amominu.w.aq s6, s5, (s4)
+amomaxu.w.aq s5, s4, (s3)
+
+amoswap.w.rl a4, ra, (s0)
+amoadd.w.rl a1, a2, (a3)
+amoxor.w.rl a2, a3, (a4)
+amoand.w.rl a3, a4, (a5)
+amoor.w.rl a4, a5, (a6)
+amomin.w.rl a5, a6, (a7)
+amomax.w.rl s7, s6, (s5)
+amominu.w.rl s6, s5, (s4)
+amomaxu.w.rl s5, s4, (s3)
+
+amoswap.w.aqrl a4, ra, (s0)
+amoadd.w.aqrl a1, a2, (a3)
+amoxor.w.aqrl a2, a3, (a4)
+amoand.w.aqrl a3, a4, (a5)
+amoor.w.aqrl a4, a5, (a6)
+amomin.w.aqrl a5, a6, (a7)
+amomax.w.aqrl s7, s6, (s5)
+amominu.w.aqrl s6, s5, (s4)
+amomaxu.w.aqrl s5, s4, (s3)
+
+amoswap.d a4, ra, (s0)
+amoadd.d a1, a2, (a3)
+amoxor.d a2, a3, (a4)
+amoand.d a3, a4, (a5)
+amoor.d a4, a5, (a6)
+amomin.d a5, a6, (a7)
+amomax.d s7, s6, (s5)
+amominu.d s6, s5, (s4)
+amomaxu.d s5, s4, (s3)
+
+amoswap.d.aq a4, ra, (s0)
+amoadd.d.aq a1, a2, (a3)
+amoxor.d.aq a2, a3, (a4)
+amoand.d.aq a3, a4, (a5)
+amoor.d.aq a4, a5, (a6)
+amomin.d.aq a5, a6, (a7)
+amomax.d.aq s7, s6, (s5)
+amominu.d.aq s6, s5, (s4)
+amomaxu.d.aq s5, s4, (s3)
+
+amoswap.d.rl a4, ra, (s0)
+amoadd.d.rl a1, a2, (a3)
+amoxor.d.rl a2, a3, (a4)
+amoand.d.rl a3, a4, (a5)
+amoor.d.rl a4, a5, (a6)
+amomin.d.rl a5, a6, (a7)
+amomax.d.rl s7, s6, (s5)
+amominu.d.rl s6, s5, (s4)
+amomaxu.d.rl s5, s4, (s3)
+
+amoswap.d.aqrl a4, ra, (s0)
+amoadd.d.aqrl a1, a2, (a3)
+amoxor.d.aqrl a2, a3, (a4)
+amoand.d.aqrl a3, a4, (a5)
+amoor.d.aqrl a4, a5, (a6)
+amomin.d.aqrl a5, a6, (a7)
+amomax.d.aqrl s7, s6, (s5)
+amominu.d.aqrl s6, s5, (s4)
+amomaxu.d.aqrl s5, s4, (s3)
+
+# CHECK: Iterations: 1
+# CHECK-NEXT: Instructions: 88
+# CHECK-NEXT: Total Cycles: 86
+# CHECK-NEXT: Total uOps: 88
+
+# CHECK: Dispatch Width: 2
+# CHECK-NEXT: uOps Per Cycle: 1.02
+# CHECK-NEXT: IPC: 1.02
+# CHECK-NEXT: Block RThroughput: 44.0
+
+# CHECK: Instruction Info:
+# CHECK-NEXT: [1]: #uOps
+# CHECK-NEXT: [2]: Latency
+# CHECK-NEXT: [3]: RThroughput
+# CHECK-NEXT: [4]: MayLoad
+# CHECK-NEXT: [5]: MayStore
+# CHECK-NEXT: [6]: HasSideEffects (U)
+
+# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
+# CHECK-NEXT: 1 5 0.50 * lr.w t0, (t1)
+# CHECK-NEXT: 1 5 0.50 * lr.w.aq t1, (t2)
+# CHECK-NEXT: 1 5 0.50 * lr.w.rl t2, (t3)
+# CHECK-NEXT: 1 5 0.50 * lr.w.aqrl t3, (t4)
+# CHECK-NEXT: 1 3 0.50 * sc.w t6, t5, (t4)
+# CHECK-NEXT: 1 3 0.50 * sc.w.aq t5, t4, (t3)
+# CHECK-NEXT: 1 3 0.50 * sc.w.rl t4, t3, (t2)
+# CHECK-NEXT: 1 3 0.50 * s...
[truncated]
|
I've discussed earlier versions of this patch with Mikhail a fair bit so I'd appreciate a review from outside our org, but I will just share my thoughts on criteria for whether this makes sense to merge at this point: For me the key thing is that per Mikhail's benchmarking it's at the point where there's pretty much across the board improvements and importantly, there's no evidence of regressions due to introducing scalar but not having vector scheduling. As for the fine details of the model, we're not going to get the very high fidelity of the 7-series model without much more microarchitectural information or a lot more reverse engineering. What is here seems a reasonable starting point. I'd appreciate comments on anything that seems anomalous vs other scheduling models - we should aim to basically match the pattern of others unless there's data or documented information to differ. With that in mind, the store latency of 3 is a slight oddity vs other similar models. But the A55 had latency=4 up to f73334c and the commit comment indicates it has very limited impact on scheduling anyway. So unless people feel strongly it's worth more investigation right now, I propose sticking with what Mikhail suggests. (Incidentally, should any of our models be setting RetireOOO like the A55 does?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks general reasonable, and I agree that for a black box schedule model the right threshold is to use observed performance.
I'm going to run through (in a second pass) the available data for this core, and cross check the model. Forthcoming shortly.
// An IEU can decode and issue two instructions at the same time | ||
def SMX60_IEU : ProcResource<2>; | ||
|
||
def SMX60_FP : ProcResource<1>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here including the bit from your review description about why dual issue isn't used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some floating-point instructions can double issue, such as those using FALU and FMAU, but not, for example, FCVT.
Mikhail mentioned that a value of 1 would give better performance, so we can start with 1. We will continue to improve this model in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with 1 here, and then see if we can split this in a follow up patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), it would be good to explore this further.
I just wonder why we didn't ask Spacemit guys to provide the schedule model. They have a compiler team but are not so active in upstream. |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Thanks Mikhail for bringing the initial schedule model support to x60. We will take a look at this patch and work with the upstream to improve the performance of x60. |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
def SpacemitX60Model : SchedMachineModel { | ||
let IssueWidth = 2; // dual-issue | ||
let MicroOpBufferSize = 0; // in-order | ||
let LoadLatency = 5; // worse case: >= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Load latency is 3 or 4 in the case of cachehit, but since load=5 actually performs the best in tests, we can keep this until another configuration beats it in test performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), please run another sweep of this parameter with the final model, and post a follow on if it needs to be tweaked slightly.
let IssueWidth = 2; // dual-issue | ||
let MicroOpBufferSize = 0; // in-order | ||
let LoadLatency = 5; // worse case: >= 3 | ||
let MispredictPenalty = 9; // nine-stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of L1 cache hit, the penalty is about 3-6 cycle
However, we didn't test the performance impact of tuning this parameter. If a different value is better for the test results, then just use it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), please run another sweep of this parameter with the final model, and post a follow on if it needs to be tweaked slightly.
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Signed-off-by: Mikhail R. Gadelha <[email protected]>
Folks, we wrote a probe to double-check all latencies in the scheduler and updated the latencies accordingly. We tested it on two different boards to confirm the numbers. The only outlier was idiv/irem, which was reported to have a latency of 3-4 in our experiments, so we went with the worst-case value shown in the C908 manual. I added a TODO to revisit this later. I also included the latencies for clmul, clmulr, and clmulh, which were missing from the first version of this PR. |
Signed-off-by: Mikhail R. Gadelha <[email protected]>
I think this is reasonable for integer divisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is converging with the feedback from @zqb-all (Thanks!). Minor comment only.
// An IEU can decode and issue two instructions at the same time | ||
def SMX60_IEU : ProcResource<2>; | ||
|
||
def SMX60_FP : ProcResource<1>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with 1 here, and then see if we can split this in a follow up patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
def SpacemitX60Model : SchedMachineModel { | ||
let IssueWidth = 2; // dual-issue | ||
let MicroOpBufferSize = 0; // in-order | ||
let LoadLatency = 5; // worse case: >= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), please run another sweep of this parameter with the final model, and post a follow on if it needs to be tweaked slightly.
let IssueWidth = 2; // dual-issue | ||
let MicroOpBufferSize = 0; // in-order | ||
let LoadLatency = 5; // worse case: >= 3 | ||
let MispredictPenalty = 9; // nine-stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), please run another sweep of this parameter with the final model, and post a follow on if it needs to be tweaked slightly.
// An IEU can decode and issue two instructions at the same time | ||
def SMX60_IEU : ProcResource<2>; | ||
|
||
def SMX60_FP : ProcResource<1>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), it would be good to explore this further.
def : WriteRes<WriteIMinMax, [SMX60_IEU]>; | ||
def : WriteRes<WriteREV8, [SMX60_IEU]>; | ||
|
||
let Latency = 2 in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up (as in, not in this patch), it would be interesting to explore if this is actually two cycle latency, or if this is micro-coded as two uops, each with latency one. You could maybe see this in perf counters.
Hi @mikhailramalho , is "probe" referring to "llvm-exegesis", or does it refer to another independent program? |
From context in offline discussion, this was an ad-hoc mix of llvm-exegesis where it seemed to produce sane results, and custom assembly snippets. There's definitely room for error here; this type of micro-architectural exploration is hard and error prone. I approved this mostly based on the net perf results, not any expectation that every number for every instruction was correct. I'd encourage you to make suggestions for improvements. Ideally in the form of pull requests, but if you want to drop comments here, Mikhail or I can follow up. |
Thanks, I don't mean to question a certain number in this configuration., this patch is good. I just want to learn this way of probing, so that it will also be helpful when improving the configuration in the future. |
Hi @zqb-all, it's a custom probing tool that we plan to share soon |
This patch adds an initial scheduler model for the SpacemiT-X60, including latency for scalar instructions only. The scheduler is based on the documented characteristics of the C908, which the SpacemiT-X60 is believed to be based on, and provides the expected latency for several instructions. I ran a probe to confirm all of these values and to get the latency of instructions not provided by the C908 documentation (e.g., double floating-point instructions). For load and store instructions, the C908 documentation says the latency is \>= 3 for load and 1 for store. I tried a few combinations of values until I got the current values of 5 and 3, which yield the best results. Although the X60 does appear to support multiple issue for at least some floating point instructions, this model assumes single issue as increasing it reduces the gains below. This patch gives a geomean improvement of ~4% on SPEC CPU 2017 for both rva22u64 and rva22u64_v, with some benchmarks improving up to 18% (508.namd_r). There were a couple of execution time regressions, but only in noisy benchmarks (523.xalancbmk_r and 510.parest_r). * rva22u64: https://lnt.lukelau.me/db_default/v4/nts/507?compare_to=405 (compares a55f727 to the baseline 8286b80) * rva22u64_v: https://lnt.lukelau.me/db_default/v4/nts/474?compare_to=404 (compares a55f727 to the baseline 8286b80) This initial scheduling model is strongly focused on providing sufficient definitions to provide improved performance for the SpacemiT-X60. Further incremental gains may be possible through a much more detailed microarchitectural analysis, but that is left to future work. Further scheduling definitions for RVV can be added in a future PR.
This patch adds an initial scheduler model for the SpacemiT-X60, including latency for scalar instructions only.
The scheduler is based on the documented characteristics of the C908, which the SpacemiT-X60 is believed to be based on, and provides the expected latency for several instructions. I ran a probe to confirm all of these values and to get the latency of instructions not provided by the C908 documentation (e.g., double floating-point instructions).
For load and store instructions, the C908 documentation says the latency is >= 3 for load and 1 for store. I tried a few combinations of values until I got the current values of 5 and 3, which yield the best results.
Although the X60 does appear to support multiple issue for at least some floating point instructions, this model assumes single issue as increasing it reduces the gains below.
This patch gives a geomean improvement of ~4% on SPEC CPU 2017 for both rva22u64 and rva22u64_v, with some benchmarks improving up to 18% (508.namd_r). There were a couple of execution time regressions, but only in noisy benchmarks (523.xalancbmk_r and 510.parest_r).
This initial scheduling model is strongly focused on providing sufficient definitions to provide improved performance for the SpacemiT-X60. Further incremental gains may be possible through a much more detailed microarchitectural analysis, but that is left to future work.
Further scheduling definitions for RVV can be added in a future PR.