0% found this document useful (0 votes)

18 views9 pages

RISC-V V Vector Extension (RVV) With Reduced Number

RISC-V V Vector Extension (RVV) with reduced number

Uploaded by

Jonathan Song

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views9 pages

RISC-V V Vector Extension (RVV) With Reduced Number

RISC-V V Vector Extension (RVV) with reduced number

Uploaded by

Jonathan Song

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

RISC-V V Vector Extension (RVV) with reduced number

of vector registers
arXiv:2410.08396v1 [cs.AR] 10 Oct 2024

∗1
Eino Jacobs , Dmitry Utyansky †1 , Muhammad Hassan ‡2
, and Thomas
Roecker §2
1
Synopsys, Inc, Sunnyvale, USA
2
Infineon Technologies AG, Munich, Germany

October 14, 2024

Abstract

To reduce the area of the RISC-V Vector extension (RVV) in small processors, the
authors are considering one simple modification: reduce the number of registers in the
vector register file. The standard “V” extension requires 32 vector registers that we propose
to reduce to 16 or 8 registers. Other features of RVV are still supported.
Reducing the number of vector registers does not generate a completely new program-
ming model: although the resulting core does not have binary code compatibility with the
standard RVV, compiling for it just requires parameterization of the vector register file
size in the compiler.
The reduced vector register file still allows for high utilization of the RVV vector
processor core. Many useful signal processing kernels require few registers.

1 Introduction
The RISC-V Vector Extension (RVV), ratified in version 1.0 [4], provides an Instruc-
tion Set Architecture (ISA) that has scalability and flexibility with respect to the under-
lying processor implementation. This allows optimization of processor designs for different
area/power/performance targets that all share a consistent programming model. One of the
advantages of this approach is cost reduction through reuse of hardware, compilers, libraries
and software.
∗ [email protected]
† [email protected]
‡ [email protected]
§ [email protected]

1
However, for low-end processors the mandatory vector register file with 32 vector registers
contributes visibly to the overall core size. In this paper we assume that the smallest practical
vector length (V LEN ) is 64 bits and therefore the smallest practical vector register file consists
of 32 registers of 64 bits for a total of 2048 bits. We also assume that the number of physical
vector registers equals the number of architectural vector registers. The RVV standard allows
for a VLEN that is lower than 64, VLEN of 32 bits, but we do not consider this practical,
because ecosystem support is currently weak for this small size and 64-bit data types are not
supported. We argue that an option to reduce the vector register file down to 16 or even 8
registers is a good choice for some designs. A vector register file with 8 registers of 64 bits has
in total 512 register bits. This is the same number of bits as in a scalar register file with 16
registers of 32 bits that are in an RV32E version of RISC-V.

2 Number of Vector Registers, their Sizes and Typical

Application Requirements
The number of vector registers and the length of vector registers must be considered together
for performance analysis. Depending on the processor data path length (DLEN ), available
functional units and other details, the dimensions of the vector register file that are required to
obtain good cycle performance and good hardware utilization will vary. This also depends on
compute kernels: complexity of computations, our ability to reuse data loaded into registers,
etc., and, therefore, depends on the application domain. In this paper we focus on embedded
applications and processor configurations minimizing processor size and power consumption.
At this area/performance point the processor has a gate count of a few dozen kgates, datapath
length of 32 bits (DLEN = 32), and either in-order single-issue or limited fusion/multi-issue
capabilities.

One attractive feature of RVV is potential ”quasi-multi-issue” through chaining. If the

vector length is longer than DLEN, a vector instruction issued in one cycle is executed for
multiple cycles (V LEN/DLEN > 1 cycle). The extra cycles can be used to issue other
instructions. These can be executed in parallel with the first vector instruction (in the same
convoy using vector processors terminology), as long as they use other execution resources. One
can think of chaining ratio as (EM U L · V LEN )/DLEN , where EM U L, or effective LM U L
(vector length multiplier), is defined by RVV 1.0 [4] as the number of registers required to hold
vector operand elements. So, with e.g. chaining ratio of 1:4 one can see the code as packages
(convoys) of 4 multicycle instructions using different execution resources.

A few considerations need to be taken into account to dimension the vector register file.

• Cycle performance and the program’s ability to fully load the processor’s functional units.
There is a sweet spot for the chaining ratio, beyond which the execution resources are
fully saturated, therefore for a given DLEN there is no benefit in increasing VLEN beyond
that.
• Processor size. For small processors, the vector register file is a substantial contributor
to the area, so obviously the smaller the better.
• Applications. There is a limit to vector length that the application can use. For typical

2
embedded applications this is naturally not too big.

We observe that in small processor designs the full RVV 1.0 vector register file (32 64-bit
vector registers) can take about 30% of the overall logic area. In this case, a reduction of
the vector register file by half (16 64-bit vector registers) or to a quarter size (8 64-bit vector
registers) can save 15% to 23% of the processor area, respectively.

For application examples, consider the automotive industry. One such application is the
filtering of ADC data, necessitating the use of algorithms such as Finite Impulse Response (FIR)
or averaging filters. Another application is the enhancement of sensor data accuracy through
virtual sensors or filtering of sensor data. This typically involves system state modelling with
Kalman Filters or Embedded AI Neural Network (NN) topologies like Multilayer Perceptrons
(MLP), autoencoders and Recurrent Neural Networks (RNNs). Additionally, audio sensor data
processing, such as key word recognition and environment sensing using audio, is another area
for which embedded processors are utilized. These algorithms typically rely heavily on matrix
and vector computations. Efficient execution of these algorithms on embedded processors is
crucial for ensuring the reliability and accuracy of the automotive systems that rely on them.

Compute kernels used in linear algebra and Digital Signal Processing (DSP) usually do not
require huge amounts of data in flight. E.g., a simple ”load-load-multiply-accumulate” pattern
can be implemented with just the bare minimum of 2 vector registers for input, 1 vector register
for output.

RVV-like vector architectures allow to avoid explicit software pipelining and loop unrolling,
with multiple loop iterations in flight, which would otherwise require multiple copies of data in
registers, requiring more registers. Essentially hardware takes care of the required pipelining,
enabling parallelism of e.g. loads and ALU operations.

Many well-known vector processors used fairly few vector registers with good results. E.g.,
all Cray computers had 8 vector registers ( [2] [3]), though of longer length each. Closer
to modern days, ARM M-profile Vector Extensions (MVE) use 8 128-bit vector registers [1].
RVV’s 32 vector registers together with the ability to combine vectors into groups using LMUL
factor allows to configure the processor for fewer longer vector registers: with LMUL=8 one
effectively has 4 vector registers of 8x longer size.

3 Processor Configuration for Embedded DSP

In the subsequent sections we focus on fixed point kernels for small embedded DSP use-cases.
We assume a small processor without floating point; inclusion of a relatively large floating point
unit would make the area savings of a reduction of the number of vector registers seem less
important.

Other features we assume:

• V LEN = 64 bits, floating point deconfigured (i.e., Zve64x extension). 64 bits as minimum
also makes sense because 64-bit integers are useful as accumulators.
• DLEN = 32 bits in the smallest configuration. This can be scaled up for bigger proces-

3
sors, with the matching vector length increase.
• Processor implementation supporting chaining of operations (at least between vector load
and ALU and Multiply-accumulate unit).
• No other multi-issue capabilities. Chaining is the only way to organize ”quasi-multi-
issue”.
• Load/store bandwidth of 32 bits. A higher bandwidth of 64 bits (DLEN · 2) would relax
the load bottleneck for some kernels, This is a topic to explore further.
• Reasonably short latency of load, MAC and ALU operations. Typically such designs
use relatively low clock frequency, hence shallow processor pipeline. For the examples
considered we use 1-cycle latency for loads, 5 cycles for multiply-accumulate, 3 cycles for
other ALU operations. For the kernels considered in this paper a small variation of the
latency does not change the results.

4 DSP Kernel Analysis with Reduced Vector Registers

In this chapter we analyze several digital signal processing kernels, useful for many appli-
cations. To quantify the quality of mapping of the kernel to the processor we use utilization
of the critical resource for each kernel. Typically this is either ALU, multiplier, or load/store
bandwidth.

One important consideration is ”run-time V LEN agnosticism”. Depending on the use

case, this might be more or less important. For deeply embedded applications it often does not
matter: the program is compiled for the specific processor, and knowledge about its param-
eters, like hardware vector length, can be used by the compiler to better optimize the code.
Specifically that allows to drop ”vset*” instructions from tight loops, minimizing number of
instructions. Also, special tail handling often can be avoided by choosing convenient sizes or
explicit tail handling outside of the main loop. In the subsequent analysis we assume that a
program is compiled for the specific configuration, so vset* instructions can be dropped from
the inner loops.

4.1 Matrix Multiplication

Each table in this and the subsequent sections shows the inner kernel of the processing
loop in sustained operation, with columns showing the instructions issued at each cycle, the
functional unit used by the instruction, and the results written to the vector register file by the
selected functional units at that cycle. A colon-separated number after vector register name
is a sequential number of DLEN -sized group of bits output at that cycle. For (V LEN = 64,
DLEN = 32)-processor each vector register is treated as two such groups, :0 and :1. So
e.g. writes of v0:0,v0:1 as a result of vwmacc v0 operation appear +5 cycles later, while the
preceding writes by the MAC correspond to the previous loop iteration. Vmacc is a widening
multiply-accumulate operation, so it generates 2x wider result, two DLEN -sized chunks.

4
For the reference processor configuration, starting with the tile size of 2 × (V LEN · 2) (e.g.
2 × 16 samples for 8-bit samples and V LEN = 64) we can get to 8/9 = 89% utilization of
the multiply-accumulate unit. For this we need 1:4 chaining ratio, achievable with LMUL=2,
as shown in Table 1. Total of 9 cycles in the loop (assuming single issue, one instruction per
cycle, no stalls), of which MAC unit is active for 8 cycles, producing double-wide result in each
cycle.

For some cases the utilization can be 100%. If one of the scalar loads can use immediate
offset, both loads can reuse the same base address and so only one address increment is required,
reducing number of instructions to 8, as shown in Table 2.

If 2 × 16 tiles are too wide for the application, inevitably we have fewer data processed with
the same code and same cycles, so e.g. for 2 × 8 tiles we will have LMUL=1 and utilization of
4/9=44%, as shown in Table 3. Or 50% if just one address update addi is needed, i.e., when
overall matrix row size fits into a 12-bit immediate offset allowed in the lb instruction.

For comparison, increasing the tile width to fit LMUL=4 allows us to obtain 100% uti-
lization, with enough ”issue cycles” to spare that can be potentially used for some additional
computations while maintaining the same MAC utilization, as shown in Table 4

Cycle Instruction issued Unit VMAC:M VLOAD:L

0 vle8.v v8,(x28) VLOAD M:v1:0,v1:1 L:v8:0
1 c.addi x28,0x10 Scalar M:v2:0,v2:1 L:v8:1
2 lb x7,(x13) Scalar M:v3:0,v3:1 L:v9:0
3 vwmacc.vx v0,x7,v8 VMAC M:v4:0,v4:1 L:v9:1
4 c.addi x13,0x1 Scalar M:v5:0,v5:1 .
5 lb x8,(x5) Scalar M:v6:0,v6:1 .
6 c.addi x5,0x1 Scalar M:v7:0,v7:1 .
7 vwmacc.vx v4,x8,v8 VMAC . .
8 bne x13,x15,0xffffffe6 Scalar M:v0:0,v0:1 .

Table 1: Matrix multiplication kernel, 2 × V LEN · 2 tile, 8 ∗ 8 → 16 bits

10 vector registers used

Cycle Instruction issued Unit VMAC:M VLOAD:L

0 vle8.v v8,(x28) VLOAD M:v1:0,v1:1 L:v8:0
1 c.addi x28,0x10 Scalar M:v2:0,v2:1 L:v8:1
2 lb x7,N(x5) Scalar M:v3:0,v3:1 L:v9:0
3 vwmacc.vx v0,x7,v8 VMAC M:v4:0,v4:1 L:v9:1
4 lb x8,(x5) Scalar M:v5:0,v5:1 .
5 c.addi x5,0x1 Scalar M:v6:0,v6:1 .
6 vwmacc.vx v4,x8,v8 VMAC M:v7:0,v7:1 .
7 bne x13,x15,0xffffffe6 Scalar M:v0:0,v0:1 .

Table 2: Matrix multiplication kernel, 2 × V LEN · 2 tile, 8 ∗ 8 → 16 bits

Single x-pointer is used to access samples from column, separated by N bytes
10 vector registers used

5
Cycle Instruction issued Unit VMAC:M VLOAD:L
0 vle8.v v8,(x28) VLOAD M:v1:0,v1:1 L:v8:0
1 c.addi x28,0x10 Scalar M:v2:0,v2:1 L:v8:1
2 lb x7,(x13) Scalar M:v3:0,v3:1 .
3 vwmacc.vx v0,x7,v8 VMAC . .
4 c.addi x13,0x1 Scalar . .
5 lb x8,(x5) Scalar . .
6 c.addi x5,0x1 Scalar . .
7 vwmacc.vx v4,x8,v8 VMAC . .
8 bne x13,x15,0xffffffe6 Scalar M:v0:0,v0:1 .

Table 3: Matrix multiplication kernel, 2 × V LEN tile, 8 ∗ 8 → 16 bits

4 vector registers used

Cycle Instruction issued Unit VMAC:M VLOAD:L

0 vle8.v v16,(x28) VLOAD M:v5:0,v5:1 L:v16:0
1 addi x28,x28,0x20 Scalar M:v6:0,v6:1 L:v16:1
2 lh x7,(x13) Scalar M:v7:0,v7:1 L:v17:0
3 - - - M:v8:0,v8:1 L:v17:1
4 - - - M:v9:0,v9:1 L:v18:0
5 - - - M:v10:0,v10:1 L:v18:1
6 vwmacc.vx v0,x7,v16 VMAC M:v11:0,v11:1 L:v19:0
7 c.addi x13,0x2 Scalar M:v12:0,v12:1 L:v19:1
8 lh x8,(x5) Scalar M:v13:0,v13:1 .
9 c.addi x5,0x2 Scalar M:v14:0,v14:1 .
10 - - - M:v15:0,v15:1 .
11 - - - M:v0:0,v0:1 .
12 - - - M:v1:0,v1:1 .
13 - - - M:v2:0,v2:1 .
14 vwmacc.vx v8,x8,v16 VMAC M:v3:0,v3:1 .
15 bne x13,x15,0xffffffe4 Scalar M:v4:0,v4:1 .

Table 4: Matrix multiplication kernel, 2 × V LEN · 4 tile, 8 ∗ 8 → 16 bits

20 vector registers used

Summarizing, while RVV 1.0 32 vector registers provide ample margin, 10 vector registers
is almost as good for this kernel. Also note that 2 × V LEN · 4 tile size might be impractical for
some applications: in small embedded applications one is often dealing with smaller matrices.
And for e.g. 8-bit data 2 × V LEN · 4 (2x32 elements) might be too wide a tile.

4.2 Vector Accumulation

Accumulation appears in many applications, for instance if one needs to compute the sum
or average of an array of data samples. The standard approach is to accumulate data in a
vector and then do a reduction sum of the vector. For sizeable vectors most cycles are spent
in the accumulation loop, so we focus on it.

6
Assuming the processor is capable of writing double-wide results to a vector register each
cycle, we have the kernel shown in Table 5, with 100% ALU and load bandwidth utilization.

Cycle Instruction issued Unit ALU:A VLOAD:L

0 vl4re16.v v0,(a0) VLOAD A:v2:0,v2:1 L:v0:0
1 vwadd.wv v2,v2,v0 ALU A:v3:0,v3:1 L:v0:1
2 addi a0,a0,16 Scalar A:v4:0,v4:1 L:v1:0
3 bne a5,a0,pc - 16 Scalar A:v5:0,v5:1 L:v1:1

Table 5: Vector accumulation, V LEN · 2 tile (LMUL=2), 16 ∗ 16 → 32 bits

6 vector registers used

4.3 Dot Product

Dot product is a building block of many linear algebra and DSP kernels. In the vector*vector
form it is load-bound, requiring two vector loads per one vector MAC, so if vector loads are
only DLEN-wide we can at best have 50% MAC utilization. LMUL=2 is enough to get to full
utilization of the load unit, as shown in Table 6

Cycle Instruction issued Unit MAC:M VLOAD:L

0 vle16.v v0,(x10) VLOAD . L:v0:0
1 c.addi x10,0x10 Scalar . L:v0:1
2 - - - . L:v1:0
3 - - - M:v4:0,v4:1 L:v1:1
4 vle16.v v2,(x11) VLOAD M:v5:0,v5:1 L:v2:0
5 c.addi x11,0x10 Scalar M:v6:0,v6:1 L:v2:1
6 vwmacc.vv v4,v0,v2 VMAC M:v7:0,v7:1 L:v3:0
7 bne x15,x10,0xfffffff0 Scalar . L:v3:1

Table 6: Dot product, V LEN · 2 tile (LMUL=2), 16 ∗ 16 → 32 bits

8 vector registers used

4.4 Matrix by Vector Multiplication

Matrix by vector multiplication is a common kernel, used for AI/inference (e.g., dense layers,
perceptron-like) and DSP (e.g., Finite Impulse Response filter). For the processor considered
this is both a load- and MAC-bound kernel, and best utilization is with LMUL=4, with 4
vector registers used for input and 8 vector registers used for output, as shown in Table 7: 9
cycles, of which MAC is active in 8 cycles, so 8/9 = 89%

7
Cycle Instruction issued Unit MAC:M LOAD:L
0 lbu a2,0(a5) Scalar M:v10:0,v10:1 a2
1 vle8.v v0,(a4) VLOAD M:v9:0,v9:1 L:v0:0
2 c.addi a5,1 Scalar M:v11:0,v11:1 L:v0:1
3 addi a4,a4,128 Scalar M:v4:0,v4:1 L:v1:0
4 - - - M:v5:0,v5:1 L:v1:1
5 - - - M:v6:0,v6:1 L:v2:0
6 vwmacc.vx v4,v0,a2 VMAC M:v7:0,v7:1 L:v2:1
7 bne s0,a5,pc - 18 Scalar M:v8:0,v8:1 L:v3:0
8 - - L:v3:1

Table 7: Matrix by vector multiplication, V LEN · 4 tile (LMUL=4), 8 ∗ 8 → 16bits

12 vector registers used

4.5 Results Summary

Table 8 summarizes kernels for a few tiling sizes, LMUL factor used, required vector register
number and resulting MAC utilization.

An increase of the number of vector registers beyond the indicated numbers does not improve
MAC utilization further. All considered microkernels are either MAC-bound or load/store
bandwidth bound.

Kernel Data Size LMUL Vector MAC Comment

registers used Utilization, %
Matrix*Matrix 2 × (V LEN · 2) 2 10 89 to 100 Depends on size (+addi)
Matrix*Matrix 2 × V LEN 1 4 44 to 50 tile too small for efficiency
Accumulation V LEN · 2 2 6 100
Dot product V LEN · 2 2 8 50 Load-limited
Matrix*Vector V LEN · 4 4 12 89

Table 8: Results summary for kernels, number of vector registers used and MAC utilization

5 Conclusions
Our analysis shows that important kernels can be efficiently implemented with fewer than 32
vector registers, the number required by the RVV 1.0 standard. This applies to vector lengths
as small as 64 bits. All kernels fit into 16 vector registers. Some fit into 8 vector registers and
all fit into 8 registers with lower utilization.

Configurations with a reduced number of vector registers enable smaller processor designs.
This approach does not require a completely new programming model. Although the resulting
processor does not have binary code compatibility with standard RVV, compiling for it only
requires parameterization of the vector register file size in the compiler. Overall optimization
approaches are still the same, allowing to have a unified source code base. This appears to be

8
a good choice for deeply embedded designs, for which unconstrained binary code compatibility
is not a hard requirement.

Our result suggests an ISA extension to reduce the number of vector registers to 16 or 8 as
part of the RVV standard.

References
[1] Arm Limited. Arm® Cortex®-M55 Processor Technical Reference Manual, Revision:
r1p1. https://developer.arm.com/documentation/101051/0101/?lang=en, 2024. [On-
line; accessed 6-Aug-2024].
[2] Cray Research, Inc. Cray-1 Computer System Hardware Reference Manual 2240004, Rev C.
http://bitsavers.trailing-edge.com/pdf/cray/CRAY-1/2240004C_CRAY-1_Hardware_Reference_Nov77.pdf,
1977. [Online; accessed 6-Aug-2024].
[3] Cray Research, Inc. CRAY-Y MP EL Functional Description, HR-04027.
https://cray-history.net/wp-content/uploads/2021/08/HR-04027-CRAY-Y-MP-EL-Functional-Descriptio
1992. [Online; accessed 30-July-2024].
[4] RISC-V International. RISC-V ”V” Vector Extension Version 1.0.
https://github.com/riscv/riscv-v-spec/releases/tag/v1.0, 2021. [Online; ac-
cessed 6-Aug-2024].

8bit Risc Processor
No ratings yet
8bit Risc Processor
7 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Milestone03 - Computer Architecture Report - Group3
No ratings yet
Milestone03 - Computer Architecture Report - Group3
45 pages
Andes RVV Webinar II Final
No ratings yet
Andes RVV Webinar II Final
35 pages
Riscv V Spec 1.0 Rc2
No ratings yet
Riscv V Spec 1.0 Rc2
112 pages
OOAD All Chapter NOTES - 20211218132503
No ratings yet
OOAD All Chapter NOTES - 20211218132503
79 pages
Riscv Spec
No ratings yet
Riscv Spec
32 pages
Risc V PDF
No ratings yet
Risc V PDF
117 pages
Manual Tankmaster Winsetup Inventory Management Software For Tank Gauging Systems en 80868
No ratings yet
Manual Tankmaster Winsetup Inventory Management Software For Tank Gauging Systems en 80868
122 pages
RISCV Summary
No ratings yet
RISCV Summary
323 pages
Vector
No ratings yet
Vector
42 pages
RISC-V Vector Extension Guide
No ratings yet
RISC-V Vector Extension Guide
72 pages
20200521_NX27V-RISC-V-Vector-Processor_English
No ratings yet
20200521_NX27V-RISC-V-Vector-Processor_English
29 pages
New Risc
No ratings yet
New Risc
42 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Design A 5-Stage Pipeline RISC-V CPU and Optimise
100% (1)
Design A 5-Stage Pipeline RISC-V CPU and Optimise
8 pages
Azure AZ-900 Exam Prep Dumps
No ratings yet
Azure AZ-900 Exam Prep Dumps
9 pages
Slide 3
No ratings yet
Slide 3
34 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
A Pluggable Vector Unit For RISC-V Vector Extension
No ratings yet
A Pluggable Vector Unit For RISC-V Vector Extension
6 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
Register Dispersion: Reducing The Footprint of The Vector Register File in Vector Engines of Low-Cost Risc-V Cpus
No ratings yet
Register Dispersion: Reducing The Footprint of The Vector Register File in Vector Engines of Low-Cost Risc-V Cpus
9 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
RISC-V Vector Extension Overview
No ratings yet
RISC-V Vector Extension Overview
21 pages
Five-Stage Pipelined 32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in VHDL
No ratings yet
Five-Stage Pipelined 32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in VHDL
6 pages
Unit 2
No ratings yet
Unit 2
43 pages
Eecs 2015 157
No ratings yet
Eecs 2015 157
21 pages
RISCV Student
No ratings yet
RISCV Student
41 pages
Installation of Wintestem Under Windows 10: 1. Issue
No ratings yet
Installation of Wintestem Under Windows 10: 1. Issue
4 pages
SIMD
No ratings yet
SIMD
44 pages
COE4590 14 Vector
No ratings yet
COE4590 14 Vector
14 pages
Voice Based Automatic Transport Enquiry Syste1
100% (2)
Voice Based Automatic Transport Enquiry Syste1
57 pages
TechTalk Kruppe Espasa RISC V Vectors and LLVM
No ratings yet
TechTalk Kruppe Espasa RISC V Vectors and LLVM
23 pages
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
No ratings yet
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
9 pages
Andes RVV Webinar IV
No ratings yet
Andes RVV Webinar IV
34 pages
CS61C 2022fa L07-Intro-RISC-V
No ratings yet
CS61C 2022fa L07-Intro-RISC-V
39 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
Aug-27-2020-Andes-RISC-V-CON-webinar
No ratings yet
Aug-27-2020-Andes-RISC-V-CON-webinar
29 pages
20200409riscv_con_online_ACE_eng_secured
No ratings yet
20200409riscv_con_online_ACE_eng_secured
26 pages
SAP FM Migration: WS_UPLOAD to GUI_UPLOAD
No ratings yet
SAP FM Migration: WS_UPLOAD to GUI_UPLOAD
12 pages
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
100% (1)
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
9 pages
Lecture2-Appendix A Instruction Set Principles
No ratings yet
Lecture2-Appendix A Instruction Set Principles
36 pages
Vector
No ratings yet
Vector
38 pages
Workbook Dump User Guide
No ratings yet
Workbook Dump User Guide
2 pages
Chips 03 00020 v3
No ratings yet
Chips 03 00020 v3
13 pages
Mapping and Execution of Nested Loops on
No ratings yet
Mapping and Execution of Nested Loops on
14 pages
Lec Riscv
No ratings yet
Lec Riscv
45 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
Project Phase1
No ratings yet
Project Phase1
2 pages
4 - RISC-V Registers and Data
No ratings yet
4 - RISC-V Registers and Data
3 pages
Riscv Mmu
No ratings yet
Riscv Mmu
7 pages
Unit Cgo2021
No ratings yet
Unit Cgo2021
13 pages
Transportable Tablespaces
No ratings yet
Transportable Tablespaces
24 pages
ECE586 Lecture 3
No ratings yet
ECE586 Lecture 3
16 pages
FPGA-Based RISC-V Processor for Education
No ratings yet
FPGA-Based RISC-V Processor for Education
5 pages
Os Final Project VM
No ratings yet
Os Final Project VM
13 pages
FeNN-A RISC-V vector processor for Spiking
No ratings yet
FeNN-A RISC-V vector processor for Spiking
7 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
RISC V Modularity
No ratings yet
RISC V Modularity
16 pages
RISC V2 A Scalable RISC V Vector Process
No ratings yet
RISC V2 A Scalable RISC V Vector Process
5 pages
Unit - Iv Functions, Structures and Unions
No ratings yet
Unit - Iv Functions, Structures and Unions
17 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Programming With Pascal
No ratings yet
Programming With Pascal
5 pages
ADP Installation & Activation Guide
100% (1)
ADP Installation & Activation Guide
7 pages
Practical No 18
No ratings yet
Practical No 18
4 pages
The RISC-V Compressed Instruction Set Manual,: Andrew Waterman Yunsup Lee David A. Patterson Krste Asanovi
No ratings yet
The RISC-V Compressed Instruction Set Manual,: Andrew Waterman Yunsup Lee David A. Patterson Krste Asanovi
23 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
IREE CodeGen - Public
No ratings yet
IREE CodeGen - Public
31 pages
RISC-V Assembly Manual
No ratings yet
RISC-V Assembly Manual
13 pages
Graded Quiz Unit 3 - Attempt Review
No ratings yet
Graded Quiz Unit 3 - Attempt Review
11 pages
Bringing Triton To AMD GPUs
No ratings yet
Bringing Triton To AMD GPUs
19 pages
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
PYTHON by Arvind Rajpurohit
No ratings yet
PYTHON by Arvind Rajpurohit
15 pages
Vinoth Kumar Resume
No ratings yet
Vinoth Kumar Resume
3 pages
Riscv Supervisor
No ratings yet
Riscv Supervisor
9 pages
Intro To FIRRTL
No ratings yet
Intro To FIRRTL
7 pages
Iree Learning Curriculum
No ratings yet
Iree Learning Curriculum
4 pages
COS3711 2024 Assignment 3
No ratings yet
COS3711 2024 Assignment 3
4 pages
ESP8266 NodeMCU IoT Sensors Guide
No ratings yet
ESP8266 NodeMCU IoT Sensors Guide
3 pages
Direct Device To Device Communication (DDD) : The Main Advantages of This New Option Are
No ratings yet
Direct Device To Device Communication (DDD) : The Main Advantages of This New Option Are
5 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
SIEMENS
100% (1)
SIEMENS
25 pages
4D GAI Descr
No ratings yet
4D GAI Descr
2 pages
RVCoreP An Optimized RISC-V Soft Processor of Five-Stage
No ratings yet
RVCoreP An Optimized RISC-V Soft Processor of Five-Stage
10 pages
FIOT Practical 12
No ratings yet
FIOT Practical 12
4 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Advanced Digital System & VLSI Design
No ratings yet
Advanced Digital System & VLSI Design
6 pages
TPF
No ratings yet
TPF
3 pages
Ethernet Communication
No ratings yet
Ethernet Communication
3 pages
CN - Prac 09 EIGRP
No ratings yet
CN - Prac 09 EIGRP
4 pages
S517 Lab Database Operationlab5
No ratings yet
S517 Lab Database Operationlab5
7 pages
Android App with Database Integration
No ratings yet
Android App with Database Integration
10 pages
Unit - Viii The AVR RISC Microcontroller's Architecture
No ratings yet
Unit - Viii The AVR RISC Microcontroller's Architecture
9 pages
A Deep Dive Into Transport Queues
No ratings yet
A Deep Dive Into Transport Queues
16 pages
Samba
100% (1)
Samba
652 pages
BricsCAD ACCESS_VIOLATION Error Log
No ratings yet
BricsCAD ACCESS_VIOLATION Error Log
56 pages

RISC-V V Vector Extension (RVV) With Reduced Number

Uploaded by

RISC-V V Vector Extension (RVV) With Reduced Number

Uploaded by

RISC-V V Vector Extension (RVV) with reduced number

October 14, 2024

2 Number of Vector Registers, their Sizes and Typical

One attractive feature of RVV is potential ”quasi-multi-issue” through chaining. If the

3 Processor Configuration for Embedded DSP

Other features we assume:

4 DSP Kernel Analysis with Reduced Vector Registers

One important consideration is ”run-time V LEN agnosticism”. Depending on the use

4.1 Matrix Multiplication

Cycle Instruction issued Unit VMAC:M VLOAD:L

Table 1: Matrix multiplication kernel, 2 × V LEN · 2 tile, 8 ∗ 8 → 16 bits

Cycle Instruction issued Unit VMAC:M VLOAD:L

Table 2: Matrix multiplication kernel, 2 × V LEN · 2 tile, 8 ∗ 8 → 16 bits

Table 3: Matrix multiplication kernel, 2 × V LEN tile, 8 ∗ 8 → 16 bits

Cycle Instruction issued Unit VMAC:M VLOAD:L

Table 4: Matrix multiplication kernel, 2 × V LEN · 4 tile, 8 ∗ 8 → 16 bits

4.2 Vector Accumulation

Cycle Instruction issued Unit ALU:A VLOAD:L

Table 5: Vector accumulation, V LEN · 2 tile (LMUL=2), 16 ∗ 16 → 32 bits

4.3 Dot Product

Cycle Instruction issued Unit MAC:M VLOAD:L

Table 6: Dot product, V LEN · 2 tile (LMUL=2), 16 ∗ 16 → 32 bits

4.4 Matrix by Vector Multiplication

Table 7: Matrix by vector multiplication, V LEN · 4 tile (LMUL=4), 8 ∗ 8 → 16bits

4.5 Results Summary

Kernel Data Size LMUL Vector MAC Comment

You might also like