RISC-V Vector Extension Guide
RISC-V Vector Extension Guide
RISC-V Vector
Extension
2
Outline
• RISC-V
• The Vector Extension
• Compilation
• How to use the Vector Extension
3
RISC-V
RISC-V Architecture
• RISC-V is an open-source-licenced architecture
• In contrast to many other architectures no licence is required to implement it
• Created by the University of California at Berkeley in 2010
• RISC-V International fosters the RISC-V Architecture
• Provides several resources to its members (e.g., educational, compliance, …)
• Unprivileged architecture specification
• Of interest to any application
• Privileged architecture specification
• Typically, only of interest to the supervisor (OS), hypervisor and/or firmware
5
RISC-V ISA Design
• The RISC-V base ISA is very minimal and simple
• Load/Store-based architecture, one addressing mode
• Around 50 instructions, only basic integer arithmetic
• No CPU flags, very similar to MIPS
• Realistically useful only for very simple CPUs or microcontrollers
• RISC-V instructions are fixed-size 32-bit
• The encoding allows for 16-bit, 48-bit and 64-bit (and even larger) formats
6
RISC-V Base ISA
• 32 integer registers of XLEN-bits size (general purpose registers, GPRs)
• x0 to x31
• x0 is constant and hardcoded to all zeros, read-only
• RV32 defines XLEN=32
• RV64 defines XLEN=64
• RISC-V 64-bit architecture does not provide 32-bit integer registers
• No further state defined
7
RISC-V Extensions
• RISC-V is augmented via the concept of extensions
• Extensions can add new instructions and CPU state
• Base ISA is called I (for Integer)
• RV32I
• RV64I (XLEN=64, adds a few arithmetic instructions to improve 32-bit integer arithmetic)
• Common Standard Extensions in a RISC-V 64-bit Linux capable core
• M. Integer multiplication and division (mul, div, rem, …)
IMAFD = G
8
Other extensions of interest
• Zb*. Bit manipulation (adds bitwise operations missing in the base ISA that improve code
generation)
• Zfh. Half-Precision Floating Point (IEEE 754 Binary16)
• Zfinx. F in X. (F and D but without a dedicated Floating-Point bank register)
• P. Packed SIMD. (Small-type integer SIMD inside GPR registers)
• Zv*. Vectors (Vector computation)
• The main topic of this course ☺
9
The Vector Extension
Vector Extension Design
• The RISC-V Vector Extension (RVV) aims at providing vector computation
capabilities to the RISC-V architecture
• RVV wants to have wide applicability, so the design is very flexible
• The flexible design poses some challenges in different aspects when using
the ISA
• Compiler
• Developers
• The flexibility also allows RVV to be used in more classical SIMD
approaches
• If some assumptions are held
11
RVV Design
• Vector ISAs are tipically large for several reasons
• They provide vector equivalents of most scalar arithmetic counterpart instructions
• They have specific instructions of their own for memory access and vector element
manipulation
• Modern Vector ISAs have some form of predication (masking) support
• operations to form predicates (e.g., comparisons)
• additional operands in the instructions
• This impacts the design of RVV
• Vector operations cannot be encoded in a 32-bit instruction
• CPU state is used instead
12
RVV Parameters and Basic State
• RVV defines 32 vector registers of size VLEN bits
• v0 to v31
• VLEN is a constant parameter chosen by the implementor and must be a power of
two
• Zv* standard extensions constraint VLEN to be at least 64 or 128
• E.g., VLEN=512 would be equivalent in size to Intel AVX-512
• VLEN is not a great name so read it as “vector register size (in bits)”
• Vectors in RVV are divided in elements.
• The size of elements in bits is at least 8 bits up to ELEN bits
• ELEN is a constant parameter chosen by the implementor
• Must be a power of two and 8 ≤ ELEN ≤ VLEN
• Zv* standard extensions constrain ELEN to be at least 32 or 64
13
Example (1)
14
Convention in these slides
Note: most Vector ISA specifications represent elements in the vector from higher-numbered registers to
lower-numbered registers but still map the lower-numbered elements to lower addresses.
15
RVV Operational State
• There are two registers used when operating vectors in RVV
• vtype: Vector Type.
• vl: Vector Length (not to be confused with VLEN!)
• vtype describes the type of vector we are going to operate and includes
• sew: Standard Element Width. Size in bits of the elements being operated
• 8 ≤ sew ≤ ELEN
• lmul: Length Multiplier. Allows grouping registers (more on this later!)
• lmul = 2k where -3 ≤ k ≤ 3 (i.e., lmul ∈ {1/8, 1/4, 1/2, 1, 2, 4, 8})
• vl describes how many elements of the vector (starting from the element zero) we
are going to operate
• 0 ≤ vl ≤ vlmax(sew, lmul)
• vlmax(sew, lmul) = (VLEN / sew) × lmul
16
Example (2)
17
Mixed element sizes
• Vectors with a smaller element size can fit a larger number of elements
• And the opposite: a vector with elements of size ELEN bits can fit the smallest
number of elements
• When operating with vectors whose elements are of different size, we have
different number of elements
• This causes problems to algorithms, which want to operate with the same number
of elements
• We can “harmonise” the number of elements when the element size is
different by either
• Not using the whole vector register for the small element sizes
• Use more than one vector register for the large element sizes
18
Length multiplier
• RVV allows the two scenarios via the length multiplier
• When lmul = 1 we can operate up to all the elements of a vector register
• When lmul < 1 we can operate up to a fraction of all the elements of a
vector register lmul ∈ {1/2, 1/4, 1/8}
• When lmul > 1 the operation uses a vector group of lmul vector registers
• A vector group “gangs” several vector registers. The vector group is identified by the
smallest numbered vector register in the group.
• 16 vector groups of lmul = 2
• v0, v2, v4, v6, v8, v10, v12, v14, v16, ..., v28, v30
• 8 vector groups of lmul = 4
• v0, v4, v8, v12, v16, v20, v24, v28
• 4 vector groups of lmul = 8
• v0, v4, v8, v16
19
Example (3)
20
Example (4)
21
Vector operation
• Vector instructions fully determine the vector operation we are going to
execute by using the values of vl and vtype
• vl and vtype act as implicit operands of the vector instructions
• When vl < vlmax then we have elements that are not operated
• Those elements are called the tail elements
• RVV offers two policies here
• tail undisturbed. Tail elements in the destination register are left unmodified.
• tail agnostic. Can behave like tail undisturbed or, alternatively, all the bits of the tail
elements of the destination register are set to 1
22
Example (5)
23
Example (6)
24
Masking (Predication)
• Control flow may be problematic when using vector instructions. Turning it
to data-flow (e.g., if-conversion) allows us to represent the control flow as
a value that can be held by a vector
• A mask vector is a vector whose elements are single bits
• There are no distinguished vector registers for mask vectors ( v0 to v31 can be used)
• RVV defines a specific layout for mask vectors where bits are packed contiguously in
the vector register, starting from the LSB bit as the 0th element of the mask.
• Instructions can be masked using the v0 register
• While it is possible to operate mask vectors in all the other registers, only v0 can be
used as a mask operand when masking a vector instruction
25
Example (7)
26
Masked Vector operation
• When executing a vector operation, the v0 register (interpreted as a mask
vector layout) determines whether a non-tail element is active or inactive.
• Active elements operate as usual
• Inactive elements are not operated at all
• Inactive elements have a policy as well
• mask undisturbed. The corresponding element of the destination register is left
unmodified.
• mask agnostic. The corresponding element of the destination register is either left
unmodified or all its bits are set to 1.
27
Setting vl and vtype
• Generic RISC-V instructions cannot set vl or vtype
• There are two cases where vl and/or vtype can change
• vsetvl / vsetvli / vsetivli instructions (“set vector length”)
• vle*ff instructions
• Set vector length instruction set the vl and the vtype
• The most common one is vsetvli
• vsetvli rd, rs, eN,mX,tP,mP (updates rd with the vector length computed)
• rs is an input register operand that contains the application vector length (AVL) which
represents the vector length the program wants to use
• vsetivli replaces this operand with a small immediate from 0 to 31.
• N in eN is the sew (8, 16, 32, 64, …)
• X in mX is the lmul (spelled as fY for 1/Y cases)
• P is the policy for tail (t) and mask (m): u for undisturbed, a for agnostic
28
Example (8)
vsetivli x10, 3, e32,m1,ta,ma # vl ← 3, sew ← 32, lmul ← 1
29
Special cases setting the vl
• vsetvli rd, x0, eN,mX,tP,mP # rd != x0
• Sets vl to vlmax(sew=N, lmul=X) and vtype to sew=N, lmul=X
Note: If only the VLEN is needed, a dedicated register VLENB exists that returns VLEN in bytes (i.e. VLENB = VLEN/8)
• vsetvli x0, x0, eN,mX,tP,mP
• Only changes vtype (assumes the application vector is the current vl)
• Only valid when the new vlmax is left unchanged
vsetivli x0, 10, e32,m1,ta,ma # vlmax = (VLEN/32) * 1 = VLEN/32
30
What if AVL > vlmax(sew,lmul)?
• The specification allows computing the new vl as
vl = min(vlmax(sew, lmul), AVL)
• However, when vlmax(sew,lmul) < AVL < 2*vlmax(sew,lmul) an
implementation may compute ceil(AVL / 2) ≤ vl ≤ vlmax(sew, lmul)
31
Many more details we cannot cover
• We cannot cover many more details of the ISA of the V extension
• If I piqued your interest, I recommend you look at the full specification
here:
• https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf
32
Comparison to other Vector/SIMD ISAs
NEC SX-Aurora
Feature Intel AVX-512 Arm SVE RISC-V Vector
TSUBASA
Is the vector register Yes. 512 bit No. From 128 bit to Yes. Current No. Powers of two,
size defined by the VLE extension allows using 128- 208 bits (in multiples generation is 16,384 from 64/128 up to
bit (SSE) and 256-bit (AVX-2)
architecture? registers. of 128 bits) bits. 65,536 bits.
Predication/Masking Yes. 8 mask registers Yes. 16 vector Yes. 16 vector mask Yes. Only v0 as an
k0-k7 predicate registers registers. implicit operand if the
(k0 hardcoded to all ones) instruction is masked.
p0-p15.
(p0-p7 masking, p8-p15 loops)
This table is by no means meant to be exhaustive, there are other important differences like the set of data types supportedby
the ISA (e.g. fixed floating types, complex, polynomials, saturated arithmetic, etc).
33
Compilation
(What we did in LLVM)
Super quick summary of how compilers work
Intermediate
Representation
35
Back end
36
Modern compilation
• Modern compiler infrastructures build on top of the concept of virtual
registers
• The compiler assumes it has an infinite amount of virtual registers
• Virtual registers are then mapped onto physical (architectural) registers of
a specific kind in a process called register allocation
• This process assumes that a temporary area exists to temporarily keep
values in case there are not enough physical registers of a given kind
• This area is commonly the memory in a process called “register spilling”
but it could be other (kinds of) registers too (at risk of spilling those too!)
37
Modern compilation and CPU state
• CPU state in the form of specific registers does not fit well the virtual register
compilation model
• Examples of such registers: vl, vtype, FP status, etc.
• There is just one register, so nothing must be assigned
• Often pack many details
• FP status register might contain bits for rounding mode and bits for the result of the last FP
instruction
• FP arithmetic instructions commonly use the former bits and define the latter bits
• Often there is an expectation that those registers are implicit
• An FP routine can use the same instructions but use a different rounding mode by setting it
beforehand
• All these details impact how the backend of the compiler represents the
instructions
38
What we did for RVV in LLVM
• In order to have a sensible representation of the instructions in the RISC-V
backend we chose to do the following:
• vl and sew are explicit operands of the instructions
• sew are always an immediate
• lmul is represented in the instruction opcode used by the compiler
• each (ISA) instruction has 7 opcodes associated in the compiler
• (the reason is that we defined 4 vector register classes, one for each lmul ≥ 1 but
LLVM does not allow overloading opcodes over different register classes)
• Instruction selection must select the right instruction based on the lmul
and set the right sew (those depend on the vector types used by the IR)
and provide a vector length.
39
Where does the vector length come from?
• If we are operating as whole vectors, the vector length is basically the
vlmax(sew, lmul)
• This happens because LLVM IR has an elementwise extension from scalar type
operations to vector type operations for almost every arithmetic instruction
• If not, the vector length is somehow provided by the user
• RVV IR intrinsics almost verbatim from the RVV C/C++ builtins
• Vector Predication IR that includes a vector length operand
40
What does this give us?
• Right after instruction selection, instructions in the low-level IR of the
compiler look like this
Vector
lmul sew
length
• This is not valid RVV code but represents the intent correctly
41
Code generation
• Now that we have associated each instruction with the sew and vector
length they need, it is time to make sure both vtype and vl are correctly set
in the CPU state.
• A pass analyses the instructions and inserts the needed vsetvli
instructions
e64,m1,ta,mu
43
How to program with the Vector Extension
Ways to use RVV
• There are many different ways to use the RVV instructions
• Assembly - Productivity
+ Control
• C/C++ Builtins
• Automatic (or semi-automatic) vectorization + Productivity
• Libraries and/or kernels - Control
46
RVV C/C++ Built-ins
• Covers all the RVV instructions (~40,000 built-ins)
• Same philosophy as EPI with respect to the explicit vector length
• Full specification at
• https://github.com/riscv-non-isa/rvv-intrinsic-doc
47
Beyond built-ins
• Built-ins may be unavoidable in some situations, but they require taking
care of low-level concerns
• Libraries typically will use them in their optimised implementations
• Typical alternatives here involve some amount of vectorization
• SLP vectorization
• Loop vectorization: #pragma omp simd
• Whole function vectorization: OpenCL, SYCL, #pragma omp declare simd
48
Loop vectorization
• Loops may be (intuitively) vectorized if
• They do not have loop-carried dependences (i.e., a parallel loop)
• Or they do, but the “source” and the “sink” of all loop-carried dependences are
many iterations apart through the iteration space
• Note that there are some cases where this definition would not apply yet the loop would still be vectorizable, e.g., reductions.
49
Vector-length specific / agnostic
• Traditionally vectorization targets a specific vector register size known to
the compiler
• This is the natural approach for vector ISAs that prescribe the size of the vector
register such as AVX-512
• This has been called “vector-length specific”(VLS) vectorization
• Arm SVE introduced “vector-length agnostic” (VLA) vectorization
• Useful for architectures where the implementation determines the size of the
vector register as it avoids having many versions for the different vector sizes
• RISC-V (and SVE) can use either approach
• The compiler must be told the minimum vector register size it can assume
• For RISC-V, at BSC we have focused on VLA
• A JIT can be used as a hybrid scheme that at runtime can do “adaptive” VLS
50
Current Vectorization Schemes in LLVM
• In a vector loop we must consider what to do when the number of
iterations is lower than the number of elements that can fit in the vector
register.
• Classical scheme implemented in LLVM: a first loop that operates with full
vectors followed by a scalar loop (epilog) that processes the remainder
elements.
• Cons: two loops, long vectors increase the risk that the program only executes the
epilog (without entering the vector loop). The epilog can be vectorized again with shorter vectors.
• “Tail folding”: only one vector loop, but first compute a mask that disables
the elements that would be past the loop boundary. Use the mask in all the
operations that need it.
• Pro: one loop
• Cons: needs to compute a mask, the mask is only used for loads/stores
51
Loop vectorization with RVV
• As part of the EPI project, we extended the LLVM loop vectorizer.
• Uses the Vector Predication IR proposed by Simon Moll
• Operations in this IR have an explicit vector length and mask
• https://llvm.org/docs/Proposals/VectorPredication.html
• Vector Predication IR maps well to RISC-V
• But it is of application for other vector ISAs such as AVX-512 or SVE
52
Vector length-based vectorization
• Following the style of the “tail folding” we can compute the vector length
using the remaining number of iterations
• Like tail folding passes the mask to all the vector operations, we pass the
vector length of the current vector loop iteration to all the vector
operations
• Pros: one loop
• Cons: need to compute the vector length in each iteration
53
Example: LLVM IR
void add_ref(int N, double *c, double *a, double *b) {
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
54
Example: assembly
void add_ref(int N, double *c, double *a, double *b) {
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
.LBB0_4: # %vector.body
slli a7, a4, 3
add a6, a2, a7
sub a5, a0, a4
vsetvli t0, a5, e64, m1, ta, mu
vle64.v v8, (a6)
add a5, a3, a7
vle64.v v9, (a5)
vfadd.vv v8, v8, v9
add a5, a1, a7
add a4, a4, t0
vse64.v v8, (a5)
bne a4, a0, .LBB0_4
55
Strided Accesses
• Sometimes loops have to do “strided accesses”
• For instance, loops that operate with arrays of complex numbers
• However, LLVM does not have the notion of strided memory access
• Complex memory accesses are handled as scatter/gather.
• This works for VLS vectorization (by analysing the vector indices), but it is harder to
do under VLA
• We have contributed experimental Vector Predication support to LLVM
• The loop vectorizer can identify such accesses and use strided memory accesses
56
Example
void zaxpy(_Complex float a,
_Complex float * __restrict dx, _Complex types are pairs of the real and imaginary parts.
_Complex float * __restrict dy,
int n) {
This code has been compiled using –Ofast so the complex multiplication does not
for (int i = 0; i < n; i++) {
dy[i] += a * dx[i]; check for NaN or infinite values (involves a runtime call that prevents vectorization)
}
}
Excerpt of the vector body using gather (vluxe) and scatter (vsoxe) The whole loop using strided load (vlse) and strided store (vsse)
…
vluxei64.v v11, (zero), v11 .LBB0_5:
vluxei64.v v10, (zero), v10 sub a5, a2, a3
vfmul.vf v12, v11, fa0 vsetvli t0, a5, e64, m1, ta, mu
vfmacc.vf v12, fa1, v10 slli a5, a3, 4
vfmul.vf v11, v11, fa1 add a4, a6, a5
vfmsac.vf v11, fa0, v10 vlse64.v v8, (a4), t1
vsetvli a5, zero, e64, m1, ta, mu add a4, a0, a5
vadd.vx v9, v9, a1 vlse64.v v9, (a4), t1
vsetvli zero, a4, e64, m1, ta, mu vfmul.vf v10, v8, fa0
vluxei64.v v10, (zero), v9 vfmacc.vf v10, fa1, v9
vsetvli a5, zero, e64, m1, ta, mu add a4, a1, a5
vadd.vi v13, v9, 8 vlse64.v v11, (a4), t1
vsetvli zero, a4, e64, m1, ta, mu add a5, a5, a7
vluxei64.v v14, (zero), v13 vlse64.v v12, (a5), t1
vfadd.vv v10, v11, v10 vfmul.vf v8, v8, fa1
vfadd.vv v11, v12, v14 vfmsac.vf v8, fa0, v9
vsoxei64.v v10, (zero), v9 vfadd.vv v8, v8, v11
vsoxei64.v v11, (zero), v13 vfadd.vv v9, v10, v12
… vsse64.v v8, (a4), t1
add a3, a3, t0
vsse64.v v9, (a5), t1
bne a3, a2, .LBB0_5
57
Summary of vectorization schemes
58
Vectorization may fail
• Compilers must be conservative with the analyses they perform to avoid
breaking the program semantics.
• Sometimes the compiler will not vectorize a loop.
• clang supports the following flags that can help identifying the reasons
• -Rpass=loop-vectorize
• Reports successfully vectorized loops.
• -Rpass-missed=loop-vectorize
• Reports loops that were not vectorized.
• -Rpass-analysis=loop-vectorize
• Reports extra details about why a loop failed to vectorize
59
Vectorization report
1 void works(long N, double *c) { test.c:3:3: remark: the cost-model indicates that
2 long i; interleaving is not beneficial [-Rpass-analysis=loop-
3 for (i = 0; i < N-1; i++) { vectorize]
4 c[i] = c[i+4]*0.5;
5 } for (i = 0; i < N-1; i++) {
6 }
7 ^
8 void fails(long N, double *c) {
9 long i; test.c:3:3: remark: vectorized loop (vectorization
10 for (i = 4; i < N; i++) { width: vscale x 1, interleaved count: 1) [-Rpass=loop-
11 c[i] = c[i-4]*0.5; vectorize]
12 }
13 } test.c:10:3: remark: loop not vectorized: Scalable
14 vectorization does not support vectorizing loops that
are not parallel yet [-Rpass-analysis=loop-vectorize]
60
VLS Loop Vectorization
• Vector Length Specific (VLS) can be used in RISC-V as well
• One must tell the compiler, the minimum size of the register it can assume
• -mllvm -riscv-v-vector-bits-min=<vlen>
• (At some point –mcpu=<cpu-name> will internally do that as well)
• Con: the code will only work in CPUs with equal or larger VLEN
• We could always use the scalar epilogue as a fallback if need be
• Pro: vl is only set outside the loop (though vtype might still have to
change)
• Pro: The compiler can generate better code because it is fully aware of the
size of the vector register.
61
Example Using VLEN ≥ 256 we can fit 4
doubles in each register
void daxpy(double a,
double * __restrict dx, .LBB0_3:
double * __restrict dy, andi a6, a2, -8
int n) { vsetivli zero, 4, e64, m1, ta, mu
int i; vfmv.v.f v8, fa0
for (i = 0; i < n; i++) { mv a4, a6
dy[i] += a * dx[i]; mv a5, a1
} mv a3, a0
} .LBB0_4:
addi a7, a5, 32
addi t0, a3, 32
vle64.v v9, (a3)
-mllvm -riscv-v-vector-bits-min=256 vle64.v v10, (t0)
vle64.v v11, (a5)
vle64.v v12, (a7)
The compiler has chosen to do “interleaving” in vfmacc.vv v11, v8, v9
which it effectively uses several (in this case 2) vfmacc.vv v12, v8, v10
vector operations per each scalar operation. vse64.v v11, (a5)
vse64.v v12, (a7)
LMUL=2 could be an alternative here if we care addi a3, a3, 64
addi a4, a4, -8
about code size.
addi a5, a5, 64
bnez a4, .LBB0_4
beq a6, a2, .LBB0_8
62
SLP Vectorization
• SLP (Superlevel Word Parallelism) is an alternate vectorization approach
that identifies repeated scalar operations and coalesces them using vector
instructions
• SLP is most practical only when we know the size of the vector register
• A scalable version is possible: check that the VLEN is large enough and branch to the
original scalar code though this has a code-size impact.
63
SLP Vectorization: Example
void saxpy8(float *__restrict A, saxpy8: # -mllvm -riscv-v-vector-bits-min=128
float *__restrict B, vsetivli zero, 4, e32, m1, ta, mu
float C) { vle32.v v8, (a1)
A[0] += C*B[0]; vle32.v v9, (a0)
A[1] += C*B[1]; vfmacc.vf v9, fa0, v8
A[2] += C*B[2]; vse32.v v9, (a0)
A[3] += C*B[3]; addi a1, a1, 16
A[4] += C*B[4]; addi a0, a0, 16
A[5] += C*B[5]; vle32.v v8, (a1)
A[6] += C*B[6]; vle32.v v9, (a0)
A[7] += C*B[7]; vfmacc.vf v9, fa0, v8
} vse32.v v9, (a0)
ret
64
OpenMP SIMD
• As of version 4.0, OpenMP has constructs that assist with vectorization
• #pragma omp simd
• Loop Vectorization
• #pragma omp declare simd
• For functions called in “#pragma omp simd” loops
• #pragma omp simd reuses the existing Loop Vectorization infrastructure
• May relax a few legality checks
• We have been implementing support for functions in #pragma omp declare
simd where the function receives a vector length
• This way it can be used in loops vectorized using vector length
65
Example BSC suggestion
_ZGVENk2vv_example:
lui a1, %hi(.LCPI1_0)
fld ft0, %lo(.LCPI1_0)(a1)
vsetvli zero, a0, e64, m2, ta, mu
vfrdiv.vf v8, v8, ft0
vfrdiv.vf v10, v10, ft0
vfadd.vv v8, v8, v10
vfrdiv.vf v8, v8, ft0
ret
66
Try our compiler!
• You can toy with our compiler for RVV in our Compiler Explorer instance
• https://repo.hca.bsc.es/epic
• Make sure you enable at least -O2 to enable vectorization
• Defaults to vector length-based vectorization
• Click on to execute using qemu with VLEN=512 bits
• Click on for examples
• To vectorize functions #pragma omp declare simd make sure you pass
–fopenmp-simd –Xclang –vectorize-wfv
• linear and uniform clauses not implemented yet
• Functions cannot contain loops
• This is a limitation of LLVM’s Loop Vectorizer currently
67
Software Development Vehicles
• How to create a quick loop of feedback for hardware-software co-design?
• Software Development Vehicles with progressive performance fidelity
68
Vehave emulator
• Trap-based emulator (slow!)
• Allow us to check correctness of the compiler and the porting of
applications to RVV
• qemu can be used as an alternative for this
• Generates traces that can be used for performance modelling and compiler
code generation analysis using tools like Paraver and MUSA
• Runs on any Linux RISC-V 64-bit platform that does not have RVV support
69
Wrap-up
Conclusions
• RISC-V Vector Extension is a powerful and flexible vector ISA
• Vector length as a convenient way to vectorize loops and functions
• It is possible to generate efficient code in RVV using LLVM
• Comprehensive set of C/C++ built-ins
• LLVM Loop Vectorizer can be used to vectorize for RVV
• Now working on making OpenMP SIMD useable too
71
Thank you!
The European Processor Initiative (EPI) has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant agreement EPI -SGA1:
826647 and under EPI-SGA2: 101036168. Please see http://www.european-processor-
initiative.eu for more information.
The European PILOT project has received funding from the European High-Performance
Computing Joint Undertaking (JU) under grant agreement No.101034126. The JU receives
support from the European Union’s Horizon 2020 research and innovation programme
and Spain, Italy, Switzerland, Germany, France, Greece, Sweden, Croatia and Turkey.
The MEEP project has received funding from the European High-Performance Computing
Joint Undertaking (JU) under grant agreement No 946002. The JU receives support from
the European Union’s Horizon 2020 research and innovation programme and Spain,
Croatia, Turkey