0% found this document useful (0 votes)

53 views72 pages

RISC-V Vector Extension Guide

Uploaded by

puzzetta345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views72 pages

RISC-V Vector Extension Guide

Uploaded by

puzzetta345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Introduction to the

RISC-V Vector
Extension

Roger Ferrer Ibáñez

August/September 2022 2022 ACM Summer School on HPC and AI

About me
• Research Engineer at the Barcelona Supercomputing Center
• Compilers and Toolchains for HPC
• Enable research and development in HPC from the compiler side of things
• Some of the things we are looking at these days
• RISC-V Vector Extension and how to vectorise for it
• OpenMP and multiple accelerators per node

2
Outline
• RISC-V
• The Vector Extension
• Compilation
• How to use the Vector Extension

3
RISC-V
RISC-V Architecture
• RISC-V is an open-source-licenced architecture
• In contrast to many other architectures no licence is required to implement it
• Created by the University of California at Berkeley in 2010
• RISC-V International fosters the RISC-V Architecture
• Provides several resources to its members (e.g., educational, compliance, …)
• Unprivileged architecture specification
• Of interest to any application
• Privileged architecture specification
• Typically, only of interest to the supervisor (OS), hypervisor and/or firmware

5
RISC-V ISA Design
• The RISC-V base ISA is very minimal and simple
• Load/Store-based architecture, one addressing mode
• Around 50 instructions, only basic integer arithmetic
• No CPU flags, very similar to MIPS
• Realistically useful only for very simple CPUs or microcontrollers
• RISC-V instructions are fixed-size 32-bit
• The encoding allows for 16-bit, 48-bit and 64-bit (and even larger) formats

6
RISC-V Base ISA
• 32 integer registers of XLEN-bits size (general purpose registers, GPRs)
• x0 to x31
• x0 is constant and hardcoded to all zeros, read-only
• RV32 defines XLEN=32
• RV64 defines XLEN=64
• RISC-V 64-bit architecture does not provide 32-bit integer registers
• No further state defined

7
RISC-V Extensions
• RISC-V is augmented via the concept of extensions
• Extensions can add new instructions and CPU state
• Base ISA is called I (for Integer)
• RV32I
• RV64I (XLEN=64, adds a few arithmetic instructions to improve 32-bit integer arithmetic)
• Common Standard Extensions in a RISC-V 64-bit Linux capable core
• M. Integer multiplication and division (mul, div, rem, …)
IMAFD = G

• A. Atomic instructions (load reserve + store conditional, atomic read-modify-write)

• F. Single-Precision Floating-Point (IEEE 754 Binary32)
• D. Double-Precision Floating-Point (IEEE 754 Binary64)
• C. Compressed Instructions (16-bit encodings for common I/F/D instructions)

8
Other extensions of interest
• Zb*. Bit manipulation (adds bitwise operations missing in the base ISA that improve code
generation)
• Zfh. Half-Precision Floating Point (IEEE 754 Binary16)
• Zfinx. F in X. (F and D but without a dedicated Floating-Point bank register)
• P. Packed SIMD. (Small-type integer SIMD inside GPR registers)
• Zv*. Vectors (Vector computation)
• The main topic of this course ☺

• More details about the current specification at

https://riscv.org/technical/specifications

9
The Vector Extension
Vector Extension Design
• The RISC-V Vector Extension (RVV) aims at providing vector computation
capabilities to the RISC-V architecture
• RVV wants to have wide applicability, so the design is very flexible
• The flexible design poses some challenges in different aspects when using
the ISA
• Compiler
• Developers
• The flexibility also allows RVV to be used in more classical SIMD
approaches
• If some assumptions are held

11
RVV Design
• Vector ISAs are tipically large for several reasons
• They provide vector equivalents of most scalar arithmetic counterpart instructions
• They have specific instructions of their own for memory access and vector element
manipulation
• Modern Vector ISAs have some form of predication (masking) support
• operations to form predicates (e.g., comparisons)
• additional operands in the instructions
• This impacts the design of RVV
• Vector operations cannot be encoded in a 32-bit instruction
• CPU state is used instead

12
RVV Parameters and Basic State
• RVV defines 32 vector registers of size VLEN bits
• v0 to v31
• VLEN is a constant parameter chosen by the implementor and must be a power of
two
• Zv* standard extensions constraint VLEN to be at least 64 or 128
• E.g., VLEN=512 would be equivalent in size to Intel AVX-512
• VLEN is not a great name so read it as “vector register size (in bits)”
• Vectors in RVV are divided in elements.
• The size of elements in bits is at least 8 bits up to ELEN bits
• ELEN is a constant parameter chosen by the implementor
• Must be a power of two and 8 ≤ ELEN ≤ VLEN
• Zv* standard extensions constrain ELEN to be at least 32 or 64

13
Example (1)

14
Convention in these slides

Note: most Vector ISA specifications represent elements in the vector from higher-numbered registers to
lower-numbered registers but still map the lower-numbered elements to lower addresses.

15
RVV Operational State
• There are two registers used when operating vectors in RVV
• vtype: Vector Type.
• vl: Vector Length (not to be confused with VLEN!)
• vtype describes the type of vector we are going to operate and includes
• sew: Standard Element Width. Size in bits of the elements being operated
• 8 ≤ sew ≤ ELEN
• lmul: Length Multiplier. Allows grouping registers (more on this later!)
• lmul = 2k where -3 ≤ k ≤ 3 (i.e., lmul ∈ {1/8, 1/4, 1/2, 1, 2, 4, 8})

• vl describes how many elements of the vector (starting from the element zero) we
are going to operate
• 0 ≤ vl ≤ vlmax(sew, lmul)
• vlmax(sew, lmul) = (VLEN / sew) × lmul

16
Example (2)

17
Mixed element sizes
• Vectors with a smaller element size can fit a larger number of elements
• And the opposite: a vector with elements of size ELEN bits can fit the smallest
number of elements
• When operating with vectors whose elements are of different size, we have
different number of elements
• This causes problems to algorithms, which want to operate with the same number
of elements
• We can “harmonise” the number of elements when the element size is
different by either
• Not using the whole vector register for the small element sizes
• Use more than one vector register for the large element sizes

18
Length multiplier
• RVV allows the two scenarios via the length multiplier
• When lmul = 1 we can operate up to all the elements of a vector register
• When lmul < 1 we can operate up to a fraction of all the elements of a
vector register lmul ∈ {1/2, 1/4, 1/8}
• When lmul > 1 the operation uses a vector group of lmul vector registers
• A vector group “gangs” several vector registers. The vector group is identified by the
smallest numbered vector register in the group.
• 16 vector groups of lmul = 2
• v0, v2, v4, v6, v8, v10, v12, v14, v16, ..., v28, v30
• 8 vector groups of lmul = 4
• v0, v4, v8, v12, v16, v20, v24, v28
• 4 vector groups of lmul = 8
• v0, v4, v8, v16

19
Example (3)

20
Example (4)

21
Vector operation
• Vector instructions fully determine the vector operation we are going to
execute by using the values of vl and vtype
• vl and vtype act as implicit operands of the vector instructions
• When vl < vlmax then we have elements that are not operated
• Those elements are called the tail elements
• RVV offers two policies here
• tail undisturbed. Tail elements in the destination register are left unmodified.
• tail agnostic. Can behave like tail undisturbed or, alternatively, all the bits of the tail
elements of the destination register are set to 1

22
Example (5)

23
Example (6)

24
Masking (Predication)
• Control flow may be problematic when using vector instructions. Turning it
to data-flow (e.g., if-conversion) allows us to represent the control flow as
a value that can be held by a vector
• A mask vector is a vector whose elements are single bits
• There are no distinguished vector registers for mask vectors ( v0 to v31 can be used)
• RVV defines a specific layout for mask vectors where bits are packed contiguously in
the vector register, starting from the LSB bit as the 0th element of the mask.
• Instructions can be masked using the v0 register
• While it is possible to operate mask vectors in all the other registers, only v0 can be
used as a mask operand when masking a vector instruction

25
Example (7)

26
Masked Vector operation
• When executing a vector operation, the v0 register (interpreted as a mask
vector layout) determines whether a non-tail element is active or inactive.
• Active elements operate as usual
• Inactive elements are not operated at all
• Inactive elements have a policy as well
• mask undisturbed. The corresponding element of the destination register is left
unmodified.
• mask agnostic. The corresponding element of the destination register is either left
unmodified or all its bits are set to 1.

27
Setting vl and vtype
• Generic RISC-V instructions cannot set vl or vtype
• There are two cases where vl and/or vtype can change
• vsetvl / vsetvli / vsetivli instructions (“set vector length”)
• vle*ff instructions
• Set vector length instruction set the vl and the vtype
• The most common one is vsetvli
• vsetvli rd, rs, eN,mX,tP,mP (updates rd with the vector length computed)
• rs is an input register operand that contains the application vector length (AVL) which
represents the vector length the program wants to use
• vsetivli replaces this operand with a small immediate from 0 to 31.
• N in eN is the sew (8, 16, 32, 64, …)
• X in mX is the lmul (spelled as fY for 1/Y cases)
• P is the policy for tail (t) and mask (m): u for undisturbed, a for agnostic

28
Example (8)
vsetivli x10, 3, e32,m1,ta,ma # vl ← 3, sew ← 32, lmul ← 1

29
Special cases setting the vl
• vsetvli rd, x0, eN,mX,tP,mP # rd != x0
• Sets vl to vlmax(sew=N, lmul=X) and vtype to sew=N, lmul=X
Note: If only the VLEN is needed, a dedicated register VLENB exists that returns VLEN in bytes (i.e. VLENB = VLEN/8)
• vsetvli x0, x0, eN,mX,tP,mP
• Only changes vtype (assumes the application vector is the current vl)
• Only valid when the new vlmax is left unchanged
vsetivli x0, 10, e32,m1,ta,ma # vlmax = (VLEN/32) * 1 = VLEN/32

<vector operations with sew=32, lmul=1>

vsetivli x0, x0, e16,mf2,ta,ma # vlmax = (VLEN/16) * 1/2 = VLEN/32

<vector operations with sew=16, lmul=1/2>

30
What if AVL > vlmax(sew,lmul)?
• The specification allows computing the new vl as
vl = min(vlmax(sew, lmul), AVL)
• However, when vlmax(sew,lmul) < AVL < 2*vlmax(sew,lmul) an
implementation may compute ceil(AVL / 2) ≤ vl ≤ vlmax(sew, lmul)

• Example: VLEN=128 → vlmax(sew=16, lmul=1) = 8

vsetivli x0, 9, e16,m1,ta,ma
vl will be such that 5 ≤ vl ≤ 8
• This makes the precise value of vl a bit unpredictable if we cannot assert
that AVL ≤ vlmax(sew, lmul)
• Note: due to the lower bound being ceil(AVL / 2) and AVL < 2*vlmax(sew,lmul), we
know the remaining elements to process is going to be less than vlmax(sew, lmul).

31
Many more details we cannot cover
• We cannot cover many more details of the ISA of the V extension
• If I piqued your interest, I recommend you look at the full specification
here:
• https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf

32
Comparison to other Vector/SIMD ISAs
NEC SX-Aurora
Feature Intel AVX-512 Arm SVE RISC-V Vector
TSUBASA
Is the vector register Yes. 512 bit No. From 128 bit to Yes. Current No. Powers of two,
size defined by the VLE extension allows using 128- 208 bits (in multiples generation is 16,384 from 64/128 up to
bit (SSE) and 256-bit (AVX-2)
architecture? registers. of 128 bits) bits. 65,536 bits.
Predication/Masking Yes. 8 mask registers Yes. 16 vector Yes. 16 vector mask Yes. Only v0 as an
k0-k7 predicate registers registers. implicit operand if the
(k0 hardcoded to all ones) instruction is masked.
p0-p15.
(p0-p7 masking, p8-p15 loops)

Set vector length No No Yes Yes

(Privileged, compatibility-only)

This table is by no means meant to be exhaustive, there are other important differences like the set of data types supportedby
the ISA (e.g. fixed floating types, complex, polynomials, saturated arithmetic, etc).

33
Compilation
(What we did in LLVM)
Super quick summary of how compilers work

Intermediate
Representation

35
Back end

Note: this is very simplified!

36
Modern compilation
• Modern compiler infrastructures build on top of the concept of virtual
registers
• The compiler assumes it has an infinite amount of virtual registers
• Virtual registers are then mapped onto physical (architectural) registers of
a specific kind in a process called register allocation
• This process assumes that a temporary area exists to temporarily keep
values in case there are not enough physical registers of a given kind
• This area is commonly the memory in a process called “register spilling”
but it could be other (kinds of) registers too (at risk of spilling those too!)

37
Modern compilation and CPU state
• CPU state in the form of specific registers does not fit well the virtual register
compilation model
• Examples of such registers: vl, vtype, FP status, etc.
• There is just one register, so nothing must be assigned
• Often pack many details
• FP status register might contain bits for rounding mode and bits for the result of the last FP
instruction
• FP arithmetic instructions commonly use the former bits and define the latter bits
• Often there is an expectation that those registers are implicit
• An FP routine can use the same instructions but use a different rounding mode by setting it
beforehand
• All these details impact how the backend of the compiler represents the
instructions

38
What we did for RVV in LLVM
• In order to have a sensible representation of the instructions in the RISC-V
backend we chose to do the following:
• vl and sew are explicit operands of the instructions
• sew are always an immediate
• lmul is represented in the instruction opcode used by the compiler
• each (ISA) instruction has 7 opcodes associated in the compiler
• (the reason is that we defined 4 vector register classes, one for each lmul ≥ 1 but
LLVM does not allow overloading opcodes over different register classes)
• Instruction selection must select the right instruction based on the lmul
and set the right sew (those depend on the vector types used by the IR)
and provide a vector length.

39
Where does the vector length come from?
• If we are operating as whole vectors, the vector length is basically the
vlmax(sew, lmul)
• This happens because LLVM IR has an elementwise extension from scalar type
operations to vector type operations for almost every arithmetic instruction
• If not, the vector length is somehow provided by the user
• RVV IR intrinsics almost verbatim from the RVV C/C++ builtins
• Vector Predication IR that includes a vector length operand

40
What does this give us?
• Right after instruction selection, instructions in the low-level IR of the
compiler look like this
Vector
lmul sew
length

%34:vr = PseudoVFADD_VV_M1 %31:vr, %33:vr, %30:gprnox0, 6

Destination First source Second source

operand operand operand

• This is not valid RVV code but represents the intent correctly

41
Code generation
• Now that we have associated each instruction with the sew and vector
length they need, it is time to make sure both vtype and vl are correctly set
in the CPU state.
• A pass analyses the instructions and inserts the needed vsetvli
instructions
e64,m1,ta,mu

%30:gprnox0 = PseudoVSETVLI %29:gprnox0, 88, implicit-def $vl, implicit-def $vtype

%34:vr = PseudoVFADD_VV_M1 %31:vr, %33:vr, $noreg, 6, implicit $vl, implicit $vtype

The original vector

length is no more
42
Final emission
• After inserting the vsetvli instructions, register allocation can run as usual
• The extra operands can be ignored for the final emission of the instructions

vsetvli t0, a5, e64,m1,ta,mu

vfadd.vv v8, v8, v9

43
How to program with the Vector Extension
Ways to use RVV
• There are many different ways to use the RVV instructions
• Assembly - Productivity
+ Control
• C/C++ Builtins
• Automatic (or semi-automatic) vectorization + Productivity
• Libraries and/or kernels - Control

• Some of those approaches may rely on JIT

• OpenCL
• SYCL
EPI built-ins
• As part of the EPI project, at BSC we implemented an initial set of intrinsics
(even before any other intrinsics for RVV existed)
• Explicit vector length obtained via vsetvl built-ins
• Not all RVV instructions are accessible, LMUL<1 not possible
• https://repo.hca.bsc.es/gitlab/rferrer/epi-builtins-ref
void saxpy(size_t n, const float a, void saxpy_epi(size_t n, const float a, const float *x, float *y) {
const float *x, float *y) { for (size_t i = 0; i < n; /* */) {
size_t i; size_t vl = __builtin_epi_vsetvl(n - i, __epi_e32, __epi_m8);
for (i = 0; i < n; i++) { __epi_16xf32 vx = __builtin_epi_vload_16xf32(&x[i], vl);
y[i] = a * x[i] + y[i]; __epi_16xf32 vy = __builtin_epi_vload_16xf32(&y[i], vl);
} __epi_16xf32 va = __builtin_epi_vfmv_v_f_16xf32(a, vl);
} vy = __builtin_epi_vfmacc_16xf32(vy, va, vx, vl);
__builtin_epi_vstore_16xf32(&y[i], vy, vl);
i += vl;
}
}

46
RVV C/C++ Built-ins
• Covers all the RVV instructions (~40,000 built-ins)
• Same philosophy as EPI with respect to the explicit vector length
• Full specification at
• https://github.com/riscv-non-isa/rvv-intrinsic-doc

void saxpy(size_t n, const float a, void saxpy_intrinsics(size_t n, const float a,

const float *x, float *y) { const float *x, float *y) {
size_t i; for (size_t i = 0; i < n; ) {
for (i = 0; i < n; i++) { size_t vl = vsetvl_e32m8(n - i);
y[i] = a * x[i] + y[i]; vfloat32m8_t vx = vle32_v_f32m8(&x[i], vl);
} vfloat32m8_t vy = vle32_v_f32m8(&y[i], vl);
} vy = vfmacc_vf_f32m8(vy, a, vx, vl);
vse32_v_f32m8(&y[i], vy, vl);
i += vl;
}
}

47
Beyond built-ins
• Built-ins may be unavoidable in some situations, but they require taking
care of low-level concerns
• Libraries typically will use them in their optimised implementations
• Typical alternatives here involve some amount of vectorization
• SLP vectorization
• Loop vectorization: #pragma omp simd
• Whole function vectorization: OpenCL, SYCL, #pragma omp declare simd

48
Loop vectorization
• Loops may be (intuitively) vectorized if
• They do not have loop-carried dependences (i.e., a parallel loop)
• Or they do, but the “source” and the “sink” of all loop-carried dependences are
many iterations apart through the iteration space
• Note that there are some cases where this definition would not apply yet the loop would still be vectorizable, e.g., reductions.

• During vectorization, the compiler replaces scalar operations as vector

operations
• Those vector operations are ultimately mapped to vector instructions

49
Vector-length specific / agnostic
• Traditionally vectorization targets a specific vector register size known to
the compiler
• This is the natural approach for vector ISAs that prescribe the size of the vector
register such as AVX-512
• This has been called “vector-length specific”(VLS) vectorization
• Arm SVE introduced “vector-length agnostic” (VLA) vectorization
• Useful for architectures where the implementation determines the size of the
vector register as it avoids having many versions for the different vector sizes
• RISC-V (and SVE) can use either approach
• The compiler must be told the minimum vector register size it can assume
• For RISC-V, at BSC we have focused on VLA
• A JIT can be used as a hybrid scheme that at runtime can do “adaptive” VLS

50
Current Vectorization Schemes in LLVM
• In a vector loop we must consider what to do when the number of
iterations is lower than the number of elements that can fit in the vector
register.
• Classical scheme implemented in LLVM: a first loop that operates with full
vectors followed by a scalar loop (epilog) that processes the remainder
elements.
• Cons: two loops, long vectors increase the risk that the program only executes the
epilog (without entering the vector loop). The epilog can be vectorized again with shorter vectors.
• “Tail folding”: only one vector loop, but first compute a mask that disables
the elements that would be past the loop boundary. Use the mask in all the
operations that need it.
• Pro: one loop
• Cons: needs to compute a mask, the mask is only used for loads/stores

51
Loop vectorization with RVV
• As part of the EPI project, we extended the LLVM loop vectorizer.
• Uses the Vector Predication IR proposed by Simon Moll
• Operations in this IR have an explicit vector length and mask
• https://llvm.org/docs/Proposals/VectorPredication.html
• Vector Predication IR maps well to RISC-V
• But it is of application for other vector ISAs such as AVX-512 or SVE

52
Vector length-based vectorization
• Following the style of the “tail folding” we can compute the vector length
using the remaining number of iterations
• Like tail folding passes the mask to all the vector operations, we pass the
vector length of the current vector loop iteration to all the vector
operations
• Pros: one loop
• Cons: need to compute the vector length in each iteration

53
Example: LLVM IR
void add_ref(int N, double *c, double *a, double *b) {
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
}

vector.body: ; preds = %for.body.preheader, %vector.body

%index = phi i64 [ %index.next, %vector.body ], [ 0, %for.body.preheader ], !dbg !28
%3 = getelementptr inbounds double, ptr %a, i64 %index
%4 = sub i64 %wide.trip.count, %index
%5 = tail call i64 @llvm.epi.vsetvl(i64 %4, i64 3, i64 0)
%6 = trunc i64 %5 to i32
%vp.op.load = tail call <vscale x 1 x double> @llvm.vp.load.nxv1f64.p0(ptr %3, <vscale x 1 x i1> %true_mask, i32 %6)
%7 = getelementptr inbounds double, ptr %b, i64 %index
%vp.op.load13 = tail call <vscale x 1 x double> @llvm.vp.load.nxv1f64.p0(ptr %7, <vscale x 1 x i1> %true_mask, i32 %6)
%vp.op = tail call <vscale x 1 x double> @llvm.vp.fadd.nxv1f64(<vscale x 1 x double> %vp.op.load,
<vscale x 1 x double> %vp.op.load13, <vscale x 1 x i1> %true_mask, i32 %6)
%8 = getelementptr inbounds double, ptr %c, i64 %index
tail call void @llvm.vp.store.nxv1f64.p0(<vscale x 1 x double> %vp.op, ptr %8, <vscale x 1 x i1> %true_mask, i32 %6)
%9 = and i64 %5, 4294967295
%index.next = add i64 %index, %9
%10 = icmp eq i64 %index.next, %wide.trip.count
br i1 %10, label %for.cond.cleanup, label %vector.body

54
Example: assembly
void add_ref(int N, double *c, double *a, double *b) {
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
}

.LBB0_4: # %vector.body
slli a7, a4, 3
add a6, a2, a7
sub a5, a0, a4
vsetvli t0, a5, e64, m1, ta, mu
vle64.v v8, (a6)
add a5, a3, a7
vle64.v v9, (a5)
vfadd.vv v8, v8, v9
add a5, a1, a7
add a4, a4, t0
vse64.v v8, (a5)
bne a4, a0, .LBB0_4

55
Strided Accesses
• Sometimes loops have to do “strided accesses”
• For instance, loops that operate with arrays of complex numbers
• However, LLVM does not have the notion of strided memory access
• Complex memory accesses are handled as scatter/gather.
• This works for VLS vectorization (by analysing the vector indices), but it is harder to
do under VLA
• We have contributed experimental Vector Predication support to LLVM
• The loop vectorizer can identify such accesses and use strided memory accesses

56
Example
void zaxpy(_Complex float a,
_Complex float * __restrict dx, _Complex types are pairs of the real and imaginary parts.
_Complex float * __restrict dy,
int n) {
This code has been compiled using –Ofast so the complex multiplication does not
for (int i = 0; i < n; i++) {
dy[i] += a * dx[i]; check for NaN or infinite values (involves a runtime call that prevents vectorization)
}
}

Excerpt of the vector body using gather (vluxe) and scatter (vsoxe) The whole loop using strided load (vlse) and strided store (vsse)
…
vluxei64.v v11, (zero), v11 .LBB0_5:
vluxei64.v v10, (zero), v10 sub a5, a2, a3
vfmul.vf v12, v11, fa0 vsetvli t0, a5, e64, m1, ta, mu
vfmacc.vf v12, fa1, v10 slli a5, a3, 4
vfmul.vf v11, v11, fa1 add a4, a6, a5
vfmsac.vf v11, fa0, v10 vlse64.v v8, (a4), t1
vsetvli a5, zero, e64, m1, ta, mu add a4, a0, a5
vadd.vx v9, v9, a1 vlse64.v v9, (a4), t1
vsetvli zero, a4, e64, m1, ta, mu vfmul.vf v10, v8, fa0
vluxei64.v v10, (zero), v9 vfmacc.vf v10, fa1, v9
vsetvli a5, zero, e64, m1, ta, mu add a4, a1, a5
vadd.vi v13, v9, 8 vlse64.v v11, (a4), t1
vsetvli zero, a4, e64, m1, ta, mu add a5, a5, a7
vluxei64.v v14, (zero), v13 vlse64.v v12, (a5), t1
vfadd.vv v10, v11, v10 vfmul.vf v8, v8, fa1
vfadd.vv v11, v12, v14 vfmsac.vf v8, fa0, v9
vsoxei64.v v10, (zero), v9 vfadd.vv v8, v8, v11
vsoxei64.v v11, (zero), v13 vfadd.vv v9, v10, v12
… vsse64.v v8, (a4), t1
add a3, a3, t0
vsse64.v v9, (a5), t1
bne a3, a2, .LBB0_5
57
Summary of vectorization schemes

58
Vectorization may fail
• Compilers must be conservative with the analyses they perform to avoid
breaking the program semantics.
• Sometimes the compiler will not vectorize a loop.
• clang supports the following flags that can help identifying the reasons
• -Rpass=loop-vectorize
• Reports successfully vectorized loops.
• -Rpass-missed=loop-vectorize
• Reports loops that were not vectorized.
• -Rpass-analysis=loop-vectorize
• Reports extra details about why a loop failed to vectorize

59
Vectorization report
1 void works(long N, double *c) { test.c:3:3: remark: the cost-model indicates that
2 long i; interleaving is not beneficial [-Rpass-analysis=loop-
3 for (i = 0; i < N-1; i++) { vectorize]
4 c[i] = c[i+4]*0.5;
5 } for (i = 0; i < N-1; i++) {
6 }
7 ^
8 void fails(long N, double *c) {
9 long i; test.c:3:3: remark: vectorized loop (vectorization
10 for (i = 4; i < N; i++) { width: vscale x 1, interleaved count: 1) [-Rpass=loop-
11 c[i] = c[i-4]*0.5; vectorize]
12 }
13 } test.c:10:3: remark: loop not vectorized: Scalable
14 vectorization does not support vectorizing loops that
are not parallel yet [-Rpass-analysis=loop-vectorize]

for (i = 4; i < N; i++) {

60
VLS Loop Vectorization
• Vector Length Specific (VLS) can be used in RISC-V as well
• One must tell the compiler, the minimum size of the register it can assume
• -mllvm -riscv-v-vector-bits-min=<vlen>
• (At some point –mcpu=<cpu-name> will internally do that as well)
• Con: the code will only work in CPUs with equal or larger VLEN
• We could always use the scalar epilogue as a fallback if need be
• Pro: vl is only set outside the loop (though vtype might still have to
change)
• Pro: The compiler can generate better code because it is fully aware of the
size of the vector register.

61
Example Using VLEN ≥ 256 we can fit 4
doubles in each register
void daxpy(double a,
double * __restrict dx, .LBB0_3:
double * __restrict dy, andi a6, a2, -8
int n) { vsetivli zero, 4, e64, m1, ta, mu
int i; vfmv.v.f v8, fa0
for (i = 0; i < n; i++) { mv a4, a6
dy[i] += a * dx[i]; mv a5, a1
} mv a3, a0
} .LBB0_4:
addi a7, a5, 32
addi t0, a3, 32
vle64.v v9, (a3)
-mllvm -riscv-v-vector-bits-min=256 vle64.v v10, (t0)
vle64.v v11, (a5)
vle64.v v12, (a7)
The compiler has chosen to do “interleaving” in vfmacc.vv v11, v8, v9
which it effectively uses several (in this case 2) vfmacc.vv v12, v8, v10
vector operations per each scalar operation. vse64.v v11, (a5)
vse64.v v12, (a7)
LMUL=2 could be an alternative here if we care addi a3, a3, 64
addi a4, a4, -8
about code size.
addi a5, a5, 64
bnez a4, .LBB0_4
beq a6, a2, .LBB0_8

62
SLP Vectorization
• SLP (Superlevel Word Parallelism) is an alternate vectorization approach
that identifies repeated scalar operations and coalesces them using vector
instructions
• SLP is most practical only when we know the size of the vector register
• A scalable version is possible: check that the VLEN is large enough and branch to the
original scalar code though this has a code-size impact.

63
SLP Vectorization: Example
void saxpy8(float *__restrict A, saxpy8: # -mllvm -riscv-v-vector-bits-min=128
float *__restrict B, vsetivli zero, 4, e32, m1, ta, mu
float C) { vle32.v v8, (a1)
A[0] += C*B[0]; vle32.v v9, (a0)
A[1] += C*B[1]; vfmacc.vf v9, fa0, v8
A[2] += C*B[2]; vse32.v v9, (a0)
A[3] += C*B[3]; addi a1, a1, 16
A[4] += C*B[4]; addi a0, a0, 16
A[5] += C*B[5]; vle32.v v8, (a1)
A[6] += C*B[6]; vle32.v v9, (a0)
A[7] += C*B[7]; vfmacc.vf v9, fa0, v8
} vse32.v v9, (a0)
ret

saxpy8: # -mllvm -riscv-v-vector-bits-min=256 (or larger)

vsetivli zero, 8, e32, m1, ta, mu
vle32.v v8, (a1)
vle32.v v9, (a0)
vfmacc.vf v9, fa0, v8
vse32.v v9, (a0)
ret

64
OpenMP SIMD
• As of version 4.0, OpenMP has constructs that assist with vectorization
• #pragma omp simd
• Loop Vectorization
• #pragma omp declare simd
• For functions called in “#pragma omp simd” loops
• #pragma omp simd reuses the existing Loop Vectorization infrastructure
• May relax a few legality checks
• We have been implementing support for functions in #pragma omp declare
simd where the function receives a vector length
• This way it can be used in loops vectorized using vector length

65
Example BSC suggestion

#pragma omp declare simd simdlen(omp_max_simdlen : 2) notinbranch

double example(double a, double b) {
return 1/(1/a + 1/b);
} Experimental clause for VLA vectorization
under discussion at OpenMP

_ZGVENk2vv_example:
lui a1, %hi(.LCPI1_0)
fld ft0, %lo(.LCPI1_0)(a1)
vsetvli zero, a0, e64, m2, ta, mu
vfrdiv.vf v8, v8, ft0
vfrdiv.vf v10, v10, ft0
vfadd.vv v8, v8, v10
vfrdiv.vf v8, v8, ft0
ret

66
Try our compiler!
• You can toy with our compiler for RVV in our Compiler Explorer instance
• https://repo.hca.bsc.es/epic
• Make sure you enable at least -O2 to enable vectorization
• Defaults to vector length-based vectorization
• Click on to execute using qemu with VLEN=512 bits
• Click on for examples

• To vectorize functions #pragma omp declare simd make sure you pass
–fopenmp-simd –Xclang –vectorize-wfv
• linear and uniform clauses not implemented yet
• Functions cannot contain loops
• This is a limitation of LLVM’s Loop Vectorizer currently

67
Software Development Vehicles
• How to create a quick loop of feedback for hardware-software co-design?
• Software Development Vehicles with progressive performance fidelity

68
Vehave emulator
• Trap-based emulator (slow!)
• Allow us to check correctness of the compiler and the porting of
applications to RVV
• qemu can be used as an alternative for this
• Generates traces that can be used for performance modelling and compiler
code generation analysis using tools like Paraver and MUSA
• Runs on any Linux RISC-V 64-bit platform that does not have RVV support

69
Wrap-up
Conclusions
• RISC-V Vector Extension is a powerful and flexible vector ISA
• Vector length as a convenient way to vectorize loops and functions
• It is possible to generate efficient code in RVV using LLVM
• Comprehensive set of C/C++ built-ins
• LLVM Loop Vectorizer can be used to vectorize for RVV
• Now working on making OpenMP SIMD useable too

71
Thank you!
The European Processor Initiative (EPI) has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant agreement EPI -SGA1:
826647 and under EPI-SGA2: 101036168. Please see http://www.european-processor-
initiative.eu for more information.
The European PILOT project has received funding from the European High-Performance
Computing Joint Undertaking (JU) under grant agreement No.101034126. The JU receives
support from the European Union’s Horizon 2020 research and innovation programme
and Spain, Italy, Switzerland, Germany, France, Greece, Sweden, Croatia and Turkey.
The MEEP project has received funding from the European High-Performance Computing
Joint Undertaking (JU) under grant agreement No 946002. The JU receives support from
the European Union’s Horizon 2020 research and innovation programme and Spain,
Croatia, Turkey

Energymetabolism Chinchilla
100% (4)
Energymetabolism Chinchilla
7 pages
Bio (In Focus Year 12)
67% (3)
Bio (In Focus Year 12)
636 pages
Riscv V Spec 1.0 Rc2
No ratings yet
Riscv V Spec 1.0 Rc2
112 pages
RISC-V Vector Extension Overview
No ratings yet
RISC-V Vector Extension Overview
21 pages
Emotional Intelligence Brochure PLI
100% (1)
Emotional Intelligence Brochure PLI
2 pages
Alain Bellon - Orbis Ardentis
100% (5)
Alain Bellon - Orbis Ardentis
16 pages
Andes RVV Webinar II Final
No ratings yet
Andes RVV Webinar II Final
35 pages
KUGWETSA Biology End of Term 1
100% (1)
KUGWETSA Biology End of Term 1
12 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
Riscv Instruction Set and Architecture
No ratings yet
Riscv Instruction Set and Architecture
13 pages
RISC-V V Vector Extension (RVV) With Reduced Number
No ratings yet
RISC-V V Vector Extension (RVV) With Reduced Number
9 pages
Slide 3
No ratings yet
Slide 3
34 pages
Audi Dynamic Steering
100% (1)
Audi Dynamic Steering
34 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Andes RVV Webinar III
No ratings yet
Andes RVV Webinar III
49 pages
RISCV Summary
No ratings yet
RISCV Summary
323 pages
L06 RISCV Functions
No ratings yet
L06 RISCV Functions
49 pages
Rvalp
No ratings yet
Rvalp
87 pages
TechTalk Kruppe Espasa RISC V Vectors and LLVM
No ratings yet
TechTalk Kruppe Espasa RISC V Vectors and LLVM
23 pages
Lec-30 EE-222
No ratings yet
Lec-30 EE-222
29 pages
A Pluggable Vector Unit For RISC-V Vector Extension
No ratings yet
A Pluggable Vector Unit For RISC-V Vector Extension
6 pages
Major1107202x12.2pscslab V4 Approve P11
No ratings yet
Major1107202x12.2pscslab V4 Approve P11
1 page
AltivecPresentation 6up
No ratings yet
AltivecPresentation 6up
3 pages
SIMD
No ratings yet
SIMD
44 pages
HFSS-High Frequency Structure Simulator
No ratings yet
HFSS-High Frequency Structure Simulator
38 pages
2 Andes Hackathon Background Knowledge
No ratings yet
2 Andes Hackathon Background Knowledge
25 pages
Spare Parts Book SK550 1.1
No ratings yet
Spare Parts Book SK550 1.1
26 pages
Unit 2
No ratings yet
Unit 2
43 pages
RISC-V Unprivileged Architecture Manual
No ratings yet
RISC-V Unprivileged Architecture Manual
236 pages
Module 2 Instructions
No ratings yet
Module 2 Instructions
31 pages
RISCV Student
No ratings yet
RISCV Student
41 pages
Rvalp
No ratings yet
Rvalp
89 pages
CAO Fall 2024 Lecture 03 Instruction Set Architecture RISC V Assembly Language
No ratings yet
CAO Fall 2024 Lecture 03 Instruction Set Architecture RISC V Assembly Language
242 pages
Cs61c Sp25 l08 Risc V Basics
No ratings yet
Cs61c Sp25 l08 Risc V Basics
37 pages
COE4590 14 Vector
No ratings yet
COE4590 14 Vector
14 pages
2018fa CS61C L10 BN Formats
No ratings yet
2018fa CS61C L10 BN Formats
28 pages
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
100% (1)
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
9 pages
Computer Architecture AllClasses-Outline-199-294
No ratings yet
Computer Architecture AllClasses-Outline-199-294
96 pages
1a.RISC-V ISA Nomenclatura
No ratings yet
1a.RISC-V ISA Nomenclatura
5 pages
CS61C 2022fa L07-Intro-RISC-V
No ratings yet
CS61C 2022fa L07-Intro-RISC-V
39 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
IntroRARS RV Assembler
No ratings yet
IntroRARS RV Assembler
26 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
James Dobson Homework
100% (1)
James Dobson Homework
6 pages
2018fa CS61C L08 BN decisionsII
No ratings yet
2018fa CS61C L08 BN decisionsII
24 pages
Vector
No ratings yet
Vector
42 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
CS 61C Floating Point, RISC-V Intro Spring 2020
No ratings yet
CS 61C Floating Point, RISC-V Intro Spring 2020
5 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
02 - Instruction Set Architecture-RV Part I V - 21in - Aug23
No ratings yet
02 - Instruction Set Architecture-RV Part I V - 21in - Aug23
32 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
CA I - Chapter 2 ISA 2 RISC V
No ratings yet
CA I - Chapter 2 ISA 2 RISC V
65 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Adjustment and Challenges of Technology and Livelihood Education Teachers in K To 12 Curriculum
No ratings yet
Adjustment and Challenges of Technology and Livelihood Education Teachers in K To 12 Curriculum
5 pages
Autodesk Inventor - Design Accelerator
No ratings yet
Autodesk Inventor - Design Accelerator
23 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Types of Concrete: Ar. C.N.Vaishnavi Ar. M.Padma
No ratings yet
Types of Concrete: Ar. C.N.Vaishnavi Ar. M.Padma
23 pages
Lec Riscv
No ratings yet
Lec Riscv
45 pages
Aula Ch2 1
No ratings yet
Aula Ch2 1
27 pages
RISC-V Assembly Manual
No ratings yet
RISC-V Assembly Manual
13 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Experiment 2
No ratings yet
Experiment 2
11 pages
CA04 2022S2 New
No ratings yet
CA04 2022S2 New
33 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Faktor Pengeboran Sumur Make Up
No ratings yet
Faktor Pengeboran Sumur Make Up
16 pages
Vector
No ratings yet
Vector
38 pages
Fitness Careers & Event Planning
No ratings yet
Fitness Careers & Event Planning
3 pages
CA I - Chapter 2 ISA 2 RISC V
No ratings yet
CA I - Chapter 2 ISA 2 RISC V
66 pages
Critical Book Review Guide
No ratings yet
Critical Book Review Guide
4 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Shakeel Saleem File Albania
No ratings yet
Shakeel Saleem File Albania
27 pages
Air Cadet Pumps Manual
No ratings yet
Air Cadet Pumps Manual
12 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Meteorological Instruments: MODEL 85000
No ratings yet
Meteorological Instruments: MODEL 85000
16 pages
Split System Air Conditioners Manual
No ratings yet
Split System Air Conditioners Manual
20 pages
Je Results DV
No ratings yet
Je Results DV
9 pages
Mahatma Gandhi University Revised Scheme For B Tech Syllabus Revision 2010 (Civil Engineering)
No ratings yet
Mahatma Gandhi University Revised Scheme For B Tech Syllabus Revision 2010 (Civil Engineering)
4 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Kawai Indonesia Factory Report
No ratings yet
Kawai Indonesia Factory Report
5 pages
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
No ratings yet
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
12 pages
STD V Intl Syllabus 2024 25
No ratings yet
STD V Intl Syllabus 2024 25
10 pages
Teaching Tools for Parsing Education
No ratings yet
Teaching Tools for Parsing Education
5 pages
Module 11
No ratings yet
Module 11
5 pages
Python Datatypes
No ratings yet
Python Datatypes
6 pages
安川ES165N en 40
No ratings yet
安川ES165N en 40
4 pages

RISC-V Vector Extension Guide

Uploaded by

RISC-V Vector Extension Guide

Uploaded by

Introduction to the

Roger Ferrer Ibáñez

August/September 2022 2022 ACM Summer School on HPC and AI

• A. Atomic instructions (load reserve + store conditional, atomic read-modify-write)

• More details about the current specification at

<vector operations with sew=32, lmul=1>

<vector operations with sew=16, lmul=1/2>

• Example: VLEN=128 → vlmax(sew=16, lmul=1) = 8

Set vector length No No Yes Yes

Note: this is very simplified!

%34:vr = PseudoVFADD_VV_M1 %31:vr, %33:vr, %30:gprnox0, 6

Destination First source Second source

%30:gprnox0 = PseudoVSETVLI %29:gprnox0, 88, implicit-def $vl, implicit-def $vtype

%34:vr = PseudoVFADD_VV_M1 %31:vr, %33:vr, $noreg, 6, implicit $vl, implicit $vtype

The original vector

vsetvli t0, a5, e64,m1,ta,mu

• Some of those approaches may rely on JIT

void saxpy(size_t n, const float a, void saxpy_intrinsics(size_t n, const float a,

• During vectorization, the compiler replaces scalar operations as vector

vector.body: ; preds = %for.body.preheader, %vector.body

for (i = 4; i < N; i++) {

saxpy8: # -mllvm -riscv-v-vector-bits-min=256 (or larger)

#pragma omp declare simd simdlen(omp_max_simdlen : 2) notinbranch

You might also like