T1(Torrent-1) is a RISC-V Vector implementation inspired by the Cray X1 vector machine, which is named after T0.
T1 aims to implement the RISC-V Vector in lane-based micro-architectures, with intensive chaining support and SRAM-based VRFs.
T1 supports standard Zve32f and Zve32x, and VLEN/DLEN can be increased up to 64K, hitting the RISC-V Vector architecture bottleneck.
T1 ships important vector machine features, e.g., lanes, chaining, and large LSU outstanding by default, but it can also be a general platform for MMIO DSA(Domain-Specific-Accelerators).
T1 is designed with Chisel and releasing T1Emulator to users.
T1 uses a forked version of the Rocket Core as the scalar part of T1. But we don't officially support it for now; it can be replaced by any other RISC-V Scalar CPU.
T1 only supports bare-metal program loading and execution; test examples can be found in the tests/ folder.
The generated T1 vector processors can integrate with any RISC-V scalar core.
- Default support for multiple lanes(32-bits per-lane).
- Load to Multiple-Exec to Store to Load chaining-ability.
- RAM-based configurable banked SRAM with DualPort, TwoPort, and SinglePort supports.
- Pipelined/Asynchronous Vector Function Unit (VFU) with comprehensive chaining support. Allocating 4 VFU slots per lane, multiple and different VFU can be attached to the corresponding lane.
- T1 lane execution can skip masked elements for the mask instructions that are all masked to accommodate the sparsity of the mask.
- We use a direct-connected lane interconnection for
widenandnarrowinstructions.
- Configurable banked memory port.
- Instruction-level Out-of-Order (OoO) load/store, leveraging the high memory bandwidth of the vector cache.
- Configurable outstanding size to mitigate memory latency.
- Fully chained to the Vector Function Unit (VFU).
Compared to some commercial Out-of-Order core designs with advanced speculation schemes, the architecture of the vector machine is relatively straightforward. Instead of dedicating the area to a Branch Prediction Unit (BPU), Rename and Reorder Buffer(ROB) or prefetching. Vector instructions provide enough metadata to allow T1 to run for thousands of elements without requiring a speculation scheme.
T1 is designed to balance the throughput, area, and frequency among the VRF, VFU, and LSU. With the T1 generator, it can be easily configured to achieve either high efficiency or high performance, depending on the desired trade-offs, even adding function units or purging out FPU, which supports Zve32f and remains Zve32x only.
The methodology for the micro-architecture tuning is based on this trade-off idea:
The overall vector core frequency should be limited by the VRF memory. Based on this principle, we could retime the VFU pipeline to multiple stages to meet the frequency target. For a small, highly efficient core, designers should choose high-density memory (which usually doesn’t offer high frequency) and reduce the VFU pipeline stages. For a high-performance core, they should increase the pipeline stages and use the fastest possible SRAM for the VRFs.
The bandwidth bottleneck is limited by VRF SRAM. For each VFU, if it is operating, it might encounter hazards due to the limited VRF memory ports. Users can increase the banking size of VRFs. The banked VRF is forcing an all-to-all crossbar between the VFU and VRF banks, which has a heavy impact on the physical design. Users should trade off the Exec and VRF bandwidth by limiting the connection between Execution and VRFs.
The bandwidth of the LSU is limited by the memory ports: The LSU is also configurable to provide an insane memory bandwidth with a small overhead. It contains these limitations to bus:
- Requiring FIFO (first-in-first-out) ordering in bus. If FIFO is not implemented in the bus IP, a large reorder unit will be implemented due to extremely large outstanding
sourceIdin TileLink, likeAWID,ARID,WID,RID,BIDin AXI protocol. - Requiring no-MMU for high-bandwidth-ports, since we may query
DLEN/32elements from TLB for each cycle in an indexed load store mode, while there might be an unreasonable page fault outstandings. For now, these features are not supported in the current Rocket Core. - No Coherence support: any high-performance cache cannot bear T1’s
DLEN/32queries.
The key point of T1 LSU is that it is designed to support multiple memory banks. Each memory bank has 3 MSHRs for outstanding memory instructions, while every instruction can record thousands of transaction states in the FIFO order. T1 also supports instruction-level interleaved vector load/store to maximize the use of memory ports for high memory bandwidth.
For tuning the ideal vector machines, follow these performance-tuning methodologies:
- Determine DLEN for your parallelism requirement, AKA the required bandwidth for the Vector unit.
- Matching bandwidth for VRF, VFU, and LSU.
- Based on your workload, determine the required VLEN as it dictates the VRF memory area.
- Choose the memory type for the VRF, which will determine the chip frequency.
- Run the T1Emulator and PnR for your workloads to tune micro-architecture.
We have a IP emulator under the directory ./t1emu. Spike is used as the reference scalar core, integrated with the verilated vector IP. Under the online differential-test strategy, the emulator compares the load/store and VRF writes between Spike and T1 to verify T1’s correctness.
docker pull ghcr.io/chipsalliance/t1-$config:latest
# For example, config with dlen 256 vlen 512 support
docker pull ghcr.io/chipsalliance/t1-blastoise:latestOr build the image using nix and load it into docker
nix build -L ".#t1.$config.release.docker-image" --out-link docker-image.tar.gz
docker load -i ./docker-image.tar.gzUsing nix to build docker-image required KVM feature, so this derivation might not be available for some platform that has no QEMU/KVM support.
We use Nix Flake as our primary build system. If you have not installed nix, install it following the guide, and enable flake following the wiki. Or you can try the installer provided by Determinate Systems, which enables flake by default.
T1 includes a hardware design written in Chisel and an emulator powered by a verilator. The elaborator and emulator can be run with various configurations. Configurations can be represented by your favorite Pokemon! The only limitation is that T1 uses Pokemon type to determine DLEN, aka lane size, based on the corresponding map:
| Type | DLEN |
|---|---|
| Grass | 32 |
| Fire | 64 |
| Flying | 128 |
| Water | 256 |
| Fighting | 512 |
| Electric | 1K |
| Ground | 1K |
| Psychic | 2K |
| Dark | 4K |
| Ice | 8K |
| Fairy | 16K |
| Ghost | 32K |
| Dragon | 64K |
Note
The Bug type is reserved to submit bug report by users.
Users can add their own pokemon to configgen/src/Main.scala to add configurations with different variations.
You can build its components with the following commands:
$ nix build .#t1.elaborator # the wrapped jar file of the Chisel elaborator
# Build T1
$ nix build .#t1.<config-name>.t1.rtl # the elaborated IP core .sv files
# Build T1 Emu
$ nix build .#t1.<config-name>.t1emu.rtl # the elaborated IP core .sv files
$ nix build .#t1.<config-name>.t1emu.verilator-emu # build the IP core emulator using verilator
$ nix build .#t1.<config-name>.t1emu.vcs-emu --impure # build the IP core emulator using VCS w/ VCS environment locally
$ nix build .#t1.<config-name>.t1emu.vcs-emu-trace --impure # build the IP core emulator using VCS w/ trace support
# Build T1 Rocket emulator
$ nix build .#t1.<config-name>.t1rocketemu.rtl # the elaborated T1 with Rocket core .sv files
$ nix build .#t1.<config-name>.t1rocketemu.verilator-emu # build the t1rocket emulator using verilator
$ nix build .#t1.<config-name>.t1rocketemu.vcs-emu # build the t1rocket emulator using VCS
$ nix build .#t1.<config-name>.t1rocketemu.vcs-emu-trace # build the t1rocket emulator using VCS with trace supportwhere <config-name> should be replaced with a configuration name, e.g. blastoise. The build output will be put in ./result directory by default.
Currently under tested configs:
| Config name | Short summary |
|---|---|
| Blastoise | DLEN256 VLEN512; FP; VRF p0rw,p1rw bank1; LSU bank8 beatbyte 8 |
| Machamp | DLEN512 VLEN1K ; NOFP; VRF p0r,p1w bank2; LSU bank8 beatbyte 16 |
| Sandslash | DLEN1K VLEN4K ; NOFP; VRF p0rw bank4; LSU bank16 beatbyte 16 |
| Alakazam | DLEN2K VLEN16K; NOFP; VRF p0rw bank8; LSU bank8 beatbyte 64 |
| t1rocket | Configs that specific to t1rocket |
The <config-name> could also be t1rocket,
this is special configuration name that enable rocket-chip support for scalar instruction.
To see all possible combination of <config-name> and <top-name>, use:
make list-configsTo run testcase on IP emulator, use the following script:
$ nix develop -c t1-helper run -i <top-name> -c <config-name> -e <emulator-type> <case-name>wheres
<config-name>is the configuration name<top-name>is one of thet1emu,t1rocketemu<emulator-type>is one of theverilator-emu,verilator-emu-trace,vcs-emu,vcs-emu-trace,vcs-emu-cover<case-name>is the name of a testcase, you can resolve runnable test cases by command:t1-helper listCases -c <config-name> <regexp>
For example:
$ nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu intrinsic.linear_normalizationTo get waveform, use the trace emulator
$ nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu-trace intrinsic.linear_normalizationThe <config-name>, <top-name> and <emulator-type> option will be cached under $XDG_CONFIG_HOME,
so if you want to test multiple test case with the same emulator,
you don't need to add -c, -i and -e option every time.
For example:
$ nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu-trace intrinsic.linear_normalization
$ nix develop -c t1-helper run pytorch.llamaTo get verbose logging, add the -v option
$ nix develop -c t1-helper run -v pytorch.lenetThe t1-helper run subcommand only run the driver without validating internal status.
To run design verification, use the t1-helper check subcommand:
$ nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu mlir.hello
$ nix develop -c t1-helper checkThe t1-helper check subcommand will read RTL event produced in run stage,
so make sure you run a test before check.
To get the coverage report, use the vcs-emu-cover emulator type:
$ nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu-cover mlir.hello$ nix run .#t1.<config-name>.<top-name>.omreader <key> # export the contents of the specified key
$ nix run .#t1.<config-name>.<top-name>.emu-omreader <key> # export the contents of the specified key with emulation supportTo dump all available keys and preview their contents:
$ nix run .#t1.<config-name>.<top-name>.omreader -- run --dump-methods
$ nix run .#t1.<config-name>.<top-name>.emu-omreader -- run --dump-methodsSchema
| Field | Type |
|---|---|
[*] |
string |
| Field | Type |
|---|---|
[*] |
array |
[*].attributes |
object |
[*].attributes[*] |
array |
[*].attributes[*].description |
string |
[*].attributes[*].identifier |
string |
[*].attributes[*].value |
string |
$ nix develop .#t1.elaborator # bring up scala environment, circt tools, and create submodules
$ nix develop .#t1.elaborator.editable # or if you want submodules editable
$ mill -i elaborator # build and run elaborator$ nix develop .#t1.<config-name>.<top-name>.vcs-dpi-lib # replace <config-name> with your configuration name
$ cd difftest
$ cargo build --feature vcsThe tests/ directory contains all the testcases.
- asm
- codegen
- intrinsic
- mlir
- perf
- pytorch
- rvv_bench
To view what is available to run, use the t1-helper listCases sub command:
$ nix develop -c t1-helper listCases -c <config-name> -i <top-name> <regexp>For example,
$ t1-helper listCases -c blastoise -i t1emu mlir
[INFO] Fetching current test cases
* mlir.axpy_masked
* mlir.conv
* mlir.hello
* mlir.matmul
* mlir.maxvl_tail_setvl_front
* mlir.rvv_vp_intrinsic_add
* mlir.rvv_vp_intrinsic_add_scalable
* mlir.stripmining
* mlir.vectoradd
$ t1-helper listCases -c blastoise -i t1emu '.*vslid.*'
[INFO] Fetching current test cases
* codegen.vslide1down_vx
* codegen.vslide1up_vx
* codegen.vslidedown_vi
* codegen.vslidedown_vx
* codegen.vslideup_vi
* codegen.vslideup_vxTo develop a specific testcases, enter the development shell:
# nix develop .#t1.<config-name>.<top-name>.cases.<type>.<name>
#
# For example:
$ nix develop .#t1.blastoise.t1emu.cases.pytorch.llamaBuild tests:
# build a single test
$ nix build .#t1.<config-name>.<top-name>.cases.intrinsic.matmul -L
$ ls -al ./resultTo develop coverage, use the following steps:
- Write the coverpoint description file at the same level as the test case source code.
- Update the
default.nixfile to parse the coverpoint description file.
For example, to develop coverage for the mlir.hello test case:
tests/mlir/hello/hello.json:
{
"assert": [
{
"name": "vmv_v_i",
"description": "single instruction vmv.v.i"
}
],
"tree": [],
"module": []
}tests/mlir/default.nix:
if [ -f ${caseName}.json ]; then
${jq}/bin/jq -r '[.assert[] | "+assert " + .name] + [.tree[] | "+tree " + .name] + [.module[] | "+module " + .name] | .[]' \
${caseName}.json > $pname.cover
else
echo "-assert *" > $pname.cover
fiThen, you can run the test building script to check if the coverage is generated correctly:
nix build .#t1.blastoise.t1emu.cases.mlir.hello -LUse the vcs-emu-cover emulator type to run the test case and generate the coverage report:
nix develop -c t1-helper run -i t1emu -c blastoise -e vcs-emu-cover mlir.helloBump nixpkgs:
$ nix flake updateBump chisel submodule versions:
$ cd nix/t1/dependencies
$ nix run '.#nvfetcher'Copyright © 2022-2023, Jiuyang Liu. Released under the Apache-2.0 License.