Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
57 views87 pages

Data-Level Parallelism Presentation (1) Morning 6 35AM

The document discusses data-level parallelism in computer architecture, focusing on vector architectures and SIMD (Single Instruction Multiple Data) extensions. It covers the RV64V extension, vector execution time factors, and programming vector architectures to optimize performance. Additionally, it highlights the evolution of SIMD instructions and the architecture of GPUs, emphasizing their role in parallel processing and memory management.

Uploaded by

gopisettypankaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views87 pages

Data-Level Parallelism Presentation (1) Morning 6 35AM

The document discusses data-level parallelism in computer architecture, focusing on vector architectures and SIMD (Single Instruction Multiple Data) extensions. It covers the RV64V extension, vector execution time factors, and programming vector architectures to optimize performance. Additionally, it highlights the evolution of SIMD instructions and the architecture of GPUs, emphasizing their role in parallel processing and memory management.

Uploaded by

gopisettypankaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

DATA-LEVEL PARALLELISM

Computer Architecture Niharika Byrapaneni


CPSC-7331 Sai Sandeep Surapaneni
Fall-2022 Pankaj Gopisetty
Santosh Mendu
Jyothi Kasamolu
Aakanksha Kavuluru

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 1/87


VECTOR ARCHITECTURE

▪ Places scattered memory in


large sequential files
▪ Acts as compiler-controlled
buffers
▪ To hide memory latency and
leverage memory bandwidth

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 2/87


RV64V EXTENSION

▪ Vector registers: hold vector each of 64


bit
▪ Functional units: pipelined to detect
hazards (structural and data)
▪ Vector load/store: stores a vector to or
from memory
▪ Scalar registers: provide data as input to
the vector functional units

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 3/87


RV64V EXTENSION

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 4/87


HOW VECTOR PROCESSORS WORK

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 5/87


VECTOR EXECUTION TIME

THREE FACTORS:
▪ Length of the operand vectors
▪ Structural hazards among the Operations
▪ Data dependences

VECTOR FUNCTIONAL UNITS WITH MULTIPLE


PARALLEL PIPELINES (OR LANES) TO PRODUCE
TWO OR MORE RESULTS PER CLOCK CYCLE

THE INSTRUCTIONS IN A VECTOR MUST


NOT CONTAIN ANY STRUCTURAL HAZARDS

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 6/87


MULTIPLE LANES: BEYOND ONE ELEMENT PER CLOCK
CYCLE

▪ Vector instruction set allows software to pass a large amount


of parallel work to hardware using only a single short
instruction.

▪ Each lane contains one portion of the vector register file for
each vector functional unit.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 7/87


VECTOR-LENGTH REGISTERS: HANDLING LOOPS NOT EQUAL TO 32

▪ Length of a vector operation is often unknown at


compile time

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 8/87


PREDICATE REGISTERS: HANDLING IF STATEMENTS IN VECTOR
LOOPS

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 9/87


PROGRAMMING VECTOR ARCHITECTURES

▪ Tell programmers at compile time


whether a section of code will vectorize
or not

▪ The median vectorization improved from


about 70% to about 90%.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 10/87


ANY QUESTIONS?

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 11/87


CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 12/87
https://davidaramant.github.io/buddhabrot/sisd_vs_simd.gif
▪ One instruction performs the multiple data operations.

SINGLE
INSTRUCTIO
N
MULTIPLE
DATA

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 13/87


SIMD INSTRUCTIONS

Tasks Assembly Intrinsic Auto-


vectorization
Vectorization Programm Programm Compiler
er er

Register allocation Programm Compiler Compiler


er

Instruction Programm Compiler Compiler


Scheduling er
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 14/87
SIMD MULTIMEDIA SET EXTENSIONS
▪ It processes small data types.

8-bit pixel for image and video and 16-bit audio samples.
▪ For a 256-bit adder, the processor could perform operations on short vectors of 32 8-bit,16 16-bit, and 8 32-bit
operands.
▪ Multimedia SIMD extensions fix the number of data operands in the opcode, which has led to the addition of
hundreds
Instructionofcategory
instructions in the MMX, SSE, and AVX extensions of the x86 architecture.
Operands

Unsigned add/subtract Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit

Maximum/Minimum Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit

Average Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit

Shift right/left Thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit

Floating point Sixteen 16-bit, eight 32-bit, four 64-bit, or two 128-bit

▪ Floating-point standard added half-precision (16- bit) and quad-precision (128-bit) floating-point
operations, which combined with the vector length register avoids the use of many opcodes.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 15/87


SIMD ISA EXTENSIONS
ISA SIMD Extension Year Instructions

X86 MultiMedia eXtensions(MMX) 1996 57

X86 Streaming SIMD Extensions(SSE) 1999 70

X86 SSE2 2001 144

X86 SSE3 2004 13

X86 SSE4 2007 47

X86 Advanced Vector Extensions 2010 256

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 16/87


MULTIMEDIA
EXTENSIONS(MMX)

https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/
▪ For x86 architecture, MMX instructions repurposed the
64-bit floating point registers.

▪ These are joined by parallel MAX and MIN operations


which are found in digital signal processors.

Pentium_II.jpg/1024px-Pentium_II.jpg
▪ MMX reused the floating-point data transfer to access
memory.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 17/87


STREAMING SIMD EXTENSION(SSE)
▪ The SSE added 16 separate registers that were 128-bit wide.

▪ It also performed parallel single-precision floating-point.

▪ Intel soon added double-precision SIMD floating-point via SSE2,SSE3,SSE4.

▪ They added ad-hoc instructions to accelerate specific multimedia functions.

128

32 32 32 32

16 16 16 16 16 16 16 16

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 18/87


ADVANCED VECTOR
EXTENSIONS(AVX)
▪ Doubled the width of registers to 256 bits(YMM Registers).

▪ It offers instructions that double the number of operations on all narrow data types.

▪ AVX2 added 30 new instructions (VGATHER) and vector shifts( VPSLL, VPSRL, VPSRA).

▪ AVX 512 in 2017 doubled the width to 512 bits(ZMM Register).Including scatter(VPSCATTER)
and mask registers(OPMASK).
256

64 64 64 64

32 32 32 32 32 32 32 32

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 19/87


AVX instruction Description

VADDPD Add four packed double-precision operands

VSUBPD Subtract four packed double-precision operands

VMULPD Multiply four packed double-precision operands

VDIVPD Divide four packed double-precision operands

VFMADDPD Multiply and add four packed double-precision operands

VFMSUBPD Multiply and subtract four packed double-precision operands

VCMPxx Compare four packed double-precision operands for EQ, NEQ, LT, LE, GT, GE, …

VMOVAPD Move aligned four packed double-precision operands

VBROADCASTSD Broadcast one double-precision operand to four locations in a 256-bit register

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 20/87


RISC-V SIMD CODE:
fld f0,a #Load scalar a

splat.4D f0,f0 #Make 4 copies


of a

addi x28,x5,#256 #Last address


to load

Loop: fld.4D f1,0(x5) #Load X[i] ...


X[i+3]

fmul.4D f1,f1,f0 #aX[i] ...


aX[i+3]

fld.4D f2,0(x6) #Load Y[i] ...


Y[i+3]

fadd.4D f2,f2,f1 # aX[i]+Y[i]...


# aX[i+3]+Y[i+3]

fsd.4D f2,0(x6) #Store Y[i]...


Y[i+3]
CPSC 7331- COMPUTER ARCHITECTURE-
addi x5,x5,#32 Presentation, Fall 2022
#Increment 21/87
Roofline Visual
Performance
Model
▪ Plot peak Floating-point
throughput as a function of
arithmetic intensity.

▪ Ties together floating-point


performance and memory
performance for a target
machine

▪ Arithmetic Intensity: Floating


point operations per byte read

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 22/87


EXAMPLES

▪ Attainable GFLOPs/s =Min(Peak Memory BW * Arithmetic Intensity,


Peak Floating-Point Perf.)

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 23/87


https://www.gocomics.com/calvinandhobbes/
QUESTIONS?

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 24/87


CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 25/87
Graphics Processing Units (GPUs)

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 26/87


Motivation
▪ Hard-core gamer

▪ Cryptocurrency proponent and miner

▪ The future shall be decentralized!

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 27/87


Single Instruction Multiple Data (SIMD)
Architecture
There are basically three variations of SIMD. They are:
▪ • Vector architectures
▪ • SIMD extensions
▪ • Graphics Processor Units (GPUs)

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 28/87


CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022

https://img-prod-cms-rt-microsoft-com.akamaized.net/cms/api/am/
imageFileData/RE4Lrag?ver=7e00
29/87
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022
CPU Vs GPU

https://hardzone.es/app/uploads-
hardzone.es/2020/12/
chrome_Y4hY9EQmlB.png
30/87
▪ Now that we have a powerful Graphics accelerator, how can it be utilized on a larger scale of applications?

▪ Heterogeneous execution model


▪ CPU acts as the host and the GPU is the device

▪ Develop a C-like programming language for GPU


▪ Compute Unified Device Architecture (CUDA) – Specific to Nvidia
▪ OpenCL for vendor-independent language

▪ Unify all forms of GPU parallelism as a CUDA thread

▪ Programming model: “Single Instruction Multiple Thread” (SIMT)

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 31/87


TERMINOLOGY

▪ A Kernel is a sequence of instructions that can be executed in parallel.

▪ A Thread is a basic unit of execution that executes a single kernel.

▪ A Thread Block is a collection of threads that execute the same kernel.

▪ A Grid is a collection of thread blocks that execute the same kernel.

▪ A CUDA program can start many grids, one for each parallel task required.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 32/87


Threads, blocks and grid
▪ A thread is associated with each data element
▪ CUDA threads 🡪 thousands of threads are utilized

▪ Threads are organized into blocks


▪ Thread Blocks: groups of up to 512 elements
▪ Multithreaded SIMD Processor: hardware that executes a whole thread block

▪ Blocks are organized into a grid


▪ Blocks are executed independently, in any order
▪ Different blocks cannot communicate directly but can coordinate using atomic memory operations
in Global Memory

▪ Thread management is handled by GPU hardware, not by applications or OS


▪ A multiprocessor composed of multithreaded SIMD processors
▪ A Thread Block Scheduler

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 33/87


Threads, blocks, and grid example

Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.


The mapping of a Grid (vectorizable
loop), Thread Blocks (SIMD basic

Patterson, 6th Edition, Morgan Kaufmann, Elsevier.


blocks), and threads of SIMD
instructions to a vector-vector
multiply, with each vector being
8192 elements long.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 34/87


NVIDIA GPU architecture

Similarities to vector machines: Differences:


▪ Works well with data-level parallel problems ▪ No scalar processor
▪ Scatter-gather transfers ▪ Uses multithreading to hide memory latency
▪ Mask registers ▪ Has many functional units, as opposed to a few
deeply pipelined units like a vector processor
▪ Large register files

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 35/87


Scalar processing

• NVIDIA engineers analyzed hundreds of shader programs that showed increasing use of scalar
computations and realized that it’s hard to efficiently utilize all processing units with a vector
architecture.
• They also estimated that as much as two 2x performance improvements can be realized from a
scalar architecture that uses 128 scalar processors versus one that uses 32 four-component
vector processors, based on the architectural efficiency of the scalar design.
• So, starting from G80, NVIDIA moved to a scalar processor-based design.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 36/87


Example: multiply two vectors of length 8192
▪ Code that works over all elements is the grid
▪ Thread blocks break this down into manageable sizes
▪ 512 threads per block
▪ SIMD instruction executes 32 elements at a time
▪ Thus, grid size = 16 blocks
▪ Block is analogous to a strip-mined vector loop with a vector length of 32
▪ Block is assigned to a multithreaded SIMD processor by the thread block scheduler

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 37/87


How Is GPU Memory Managed?

▪ CUDA Memory Management API


▪ Allocation of GPU memory
▪ Transfer of data from the host to GPU memory
▪ Free-ing GPU memory

Host Function CUDA Analogue

malloc cudaMalloc

memcpy cudaMemcpy

free cudaFree

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 38/87


GPU Memory Structure For CUDA Programming

Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.


Local variables, etc

Explicitly managed
using shared

Patterson, 6th Edition, Morgan Kaufmann, Elsevier.


cudaMalloc

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 39/87


GPU Memory Structures

▪ Each SIMD Lane has a private section of off-chip DRAM


o “Private memory”, not shared by any other lanes
o Contains stack frame, spilling registers, and private variables
o Recent GPUs cache in L1 and L2 caches
▪ Each multithreaded SIMD processor also has local memory that
is on-chip
o Shared by SIMD lanes/threads within a block only
▪ The off-chip memory shared by SIMD processors is GPU memory
o Host can read and write GPU memory

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 40/87


CUDA Core

https://www.techcenturion.com/wp-content/uploads/2020/09/Nvidia-
▪ A CUDA Core is a simple scalar processor that has a fully

pipelined Arithmetic Logic Unit (ALU) and a Floating Point

Unit (FPU).

▪ It executes a single floating point or integer instruction

per clock.

CUDA-Core.jpg
▪ It has a very simple instruction pipeline.

▪ Fermi architecture consists of 512 CUDA cores.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 41/87


Full-chip block diagram of the Pascal P100 GPU

https://www.techpowerup.com/img/16-04-12/30a.jpg
It has 56
multithreaded SIMD
Processors, each
with an L1 cache and
local memory, 32 L2
units, and a memory-
bus width of 4096
data wires. (It has 60
blocks, with four
spares to improve
yield.) The P100 has
4 HBM2 ports
supporting up to 16
GB of capacity. It
contains 15.4 billion
transistors.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 42/87


Block diagram of the multithreaded SIMD
Processor of a Pascal GPU

Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.


Each of the 64 SIMD
Lanes
(cores) has a pipelined
floating-point unit, a
pipelined integer unit,
some logic for
dispatching instructions

Patterson, 6th Edition, Morgan Kaufmann, Elsevier.


and operands to these
units, and a queue for
holding results. The 64
SIMD Lanes interact with
32 double-precision
ALUs (DP
units) that perform 64-
bit floating-point
arithmetic, 16 load-store
units (LD/STs), and 16
special function units
(SFUs) that calculate
functions such as square
roots, reciprocals, sines,
and cosines.
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 43/87
Pascal architecture innovations

▪ Fast single-precision, double-precision, and half-precision floating-point arithmetic—Pascal GP100

chip has significant floating-point performance in three sizes.

▪ High-bandwidth memory—The high-bandwidth memory (HBM2) is more than twice as fast as

previous GPUs.

▪ High-speed chip-to-chip interconnect—Pascal GP100 introduces the NVLink communications channel

that supports data transfers of up to 20 GB/s in each direction.

▪ Unified virtual memory and paging support—The Pascal GP100 GPU adds page-fault capabilities

within a unified virtual address space.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 44/87


CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022
https://www.slideteam.net/media/catalog/product/cache/1280x720/3/d/
45/87 3d_man_covered_with_question_stock_photo_Slide01.jpg
QUESTIONS?
THANK YOU

https://imgs.xkcd.com/comics/circumappendiceal_somectomy_2x.png
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 46/87
Appendix G
Vector Processors in
More Depth

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 47/87


What is Vector Processors?

▪ Vector processors is a central processing


unit that can perform the complete vector

https://electronicsdesk.com/vector-
input in individual instruction.

▪ It is a complete unit of hardware resources


that implements a sequential set of similar
data elements in the memory using
individual instruction.

processor.html
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 48/87
Characteristics of several vector-register
architectures

▪ A vector processor is a CPU in a computer with parallel processors and the capability for vector
processing.

▪ The main characteristic of a vector processor is that it makes use of the parallel processing
capability of the processor where two or more processors operate concurrently.

▪ This makes it possible for the processors to perform multiple tasks simultaneously or for the task to
be split into different subtasks handled by different processors and combined to get the result.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 49/87


Startup overhead
▪ The processing time required by system software includes the operating system and any utility that
supports application programs.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 50/87


Pipeline Instruction
Start-Up and
Multiple Lanes

https://www.javatpoint.com/instruction-pipeline
▪ Adding multiple lanes increases peak
performance but does not change start-up
latency,

▪ It becomes critical to reduce start-up


overhead by allowing the start of one vector
instruction to be overlapped with the
completion of preceding vector instructions.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 51/87


Vector Memory Systems
To maintain an initiation rate of one
word fetched or stored per clock, This usually done by spreading
the memory system must be accesses across multiple
capable of producing or accepting independent memory banks.
this much data.

Having significant numbers of The desired access rate and the


banks is useful for dealing with bank access time determined how
vector loads or stores that access many banks were needed to access
rows or columns of data. memory without stalls.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 52/87


Vector Memory Systems in More Depth

https://www.cs.uic.edu/~ajayk/c566/
VectorProcessors.pdf
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 53/87
VECTOR PERFORMANCE
▪ Vector execution time depends on:

▪ Length of operand vectors

▪ Data Dependencies

▪ Structural Hazards

▪ Initiation rate: The rate at which a


vector unit consumes new operands
and produces new results.

▪ Convoy: A set of vector instructions


that can begin execution in the
same clock.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 54/87


IMPROVING PERFORMANCE
▪ If we think of the registers used as not one big block but a group of individual registers, we can
pipeline data to improve performance.

▪ For example, MULV.D V1, V2, V3 ADDV.D V4, V1, and V5 need to be in separate convoys if we
approach the register as a whole block.

▪ If we consider it as a group of individual registers, each containing one value, then the second ADDV
can start as soon as the first element becomes available.

▪ Increases convoy size and increases HW

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 55/87


ADVANTAGES OF VECTOR PROCESSING

▪ Each result is independent of previous results - allowing high clock rates.

▪ Less memory access = faster processing time.

▪ Vector instructions access memory a block at a time which results in very


low memory latency.

▪ Lower cost due to a low number of operations compared to scalar


counterparts.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 56/87


DISADVANTAGES OF VECTOR PROCESSING

▪ Needs large blocks of data to operate on to be efficient because of recent


advances increasing the speed of accessing memory.

▪ The high price of individual chips due to limitations of on-chip memory.

▪ Increased code complexity is needed to vectorize the data.

▪ High cost in design and low returns compared to superscalar


microprocessors.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 57/87


APPLICATIONS OF VECTOR PERFORMANCE

▪ Multimedia Processing compress., graphics, audio synth, image proc.

▪ Speech and handwriting recognition.

▪ Useful in applications that involve comparing or processing large blocks of


data.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 58/87


ANY QUESTIONS?

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 59/87


CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 60/87
Enhancing Vector Performance

Chaining in More Depth

• Recent implementations use flexible chaining, which allows a vector instruction to chain to
essentially any other active vector instruction, assuming that no structural hazard is
generated.

• Flexible chaining requires simultaneous access to the same vector register by different
vector instructions, which can be implemented either by adding more read and write
ports or by organizing the vector-register file storage into interleaved
banks in a similar way to the memory system.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 61/87


Even though a pair of operations depend on one another, chaining allows the operations to
proceed in parallel on separate elements of the vector.

the operations are dependent! The total running time for the above sequence becomes:

Vector length + Start-up timeADDV + Start-up timeMULV

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 62/87


 It shows the timing of a chained and an unchained version of the above
 pair of vector instructions with a vector length of 64.

 the total time for chained operation is 77 clock cycles, or 1.2 cycles per result.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 63/87


 With 128 floating-point operations done in that time, 1.7 FLOPS per clock cycle are obtained.

 For the unchained version, there are 141 clock cycles or 0.9 FLOPS per clock cycle

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 64/87


This means, for example, that a sequence containing two vector memory instructions must take at least two
convoys, and hence two chimes, on a processor like VMIPS with only one vector load-store unit.

Chaining is so important that every modern vector processor supports flexible chaining.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 65/87


Sparse Matrices in More Depth

In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed
indirectly.

Assuming a simplified sparse structure, we might see code that looks like this:

do 100 i = 1,n

100 A(K(i)) = A(K(i)) + C(M(i))

This code implements a sparse vector sum on the arrays A and C, using index

vectors K and M to designate the nonzero elements of A and C.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 66/87


More sophisticated vectorizing compilers can vectorize the loop automatically without programmer
annotations by inserting run time checks for data dependences.

These run time checks are implemented with a vectorized software

version of the advanced load address table (ALAT) hardware described in Appendix H for the Itanium
processor.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 67/87


The indexed loads-stores and the CVI instruction provide an alternative method to support conditional vector
execution.

low = 1

VL = (n mod MVL) /*find the odd-size piece*/

do 1 j = 0,(n/MVL) /*outer loop*/

do 10 i = low, low + VL - 1 /*runs for length VL*/

Y(i) = a * X(i) + Y(i) /*main operation*/

10 continue

low = low + VL /*start of next vector*/

VL = MVL /*reset the length to max*/

1 continue

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 68/87


Here is a vector sequence that implements that loop using CVI:

LV V1,Ra ;load vector A into V1

L.D F0,#0 ;load FP zero into F0

SNEVS.D V1,F0 ;sets the VM to 1 if V1(i)!=F0

CVI V2,#8 ;generates indices in V2

POP R1,VM ;find the number of 1’s in VM

MTC1 VLR,R1 ;load vector-length register

CVM ;clears the mask

LVI V3,(Ra+V2) ;load the nonzero A elements

LVI V4,(Rb+V2) ;load corresponding B elements

SUBV.D V3,V3,V4 ;do the subtract

SVI (Ra+V2),V3 ;store A back

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 69/87


The running time of the second version, using indexed loads and stores with a running time of one element
per clock, is 4n+ 4fn+c2, where f is the fraction of elements for which the condition is true (i.e., A(i) ¦ 0).

Time1 ¼ 5 nð Þ

Time2 ¼ 4n + 4fn

We want Time1>Time2, so

5n > 4n + 4fn

1/4> f

That is, the second method is faster if less than one-quarter of the elements are non zero.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 70/87


Effectiveness of Compiler Vectorization.

The second factor is the capability of the compiler. While no compiler canvectorize a loop where no
parallelism among the loop iterations exists, there is tremendous variation in the ability of compilers to
determine whether a loop can be vectorized.

There is tremendous variation in how well different compilers do in vectorizing programs.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 71/87


consider the data in which shows the extent of vectorization for different processors using atest suite of
100 handwritten FORTRAN kernels. The kernels were designed to test

vectorization capability and can all be vectorized by hand; we will see several examples of these loops in
the exercises.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 72/87


• Two different compilers for the Cray X-MP show the largedependence on compiler
technology.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 73/87


Any queries

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 74/87


THANK YOU

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 75/87


VECTOR PROCESSOR ARCHITECTURE

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 76/87


PERFORMANCE MEASURES IN VECTOR PERFORMANCE

 FLOP = “FLoating point OPeration”


• FLOPs = “FLoating point OPerations”
• MFLOPS = Million FLOPs per Second
 Latency
• When are vectors so small that using a scalar unit is faster than waiting for a deeply pipelined vector unit?
 VRF characteristics
• How big is the VRF (number of vectors, vector length)?
• How many ports available to the VRF?
 Bandwidth available to memory
• FLOPs per memory access is a measure of data locality
• Bandwidth available to memory may limit performance
• How many VAG units are available for simultaneous accesses?
• How effective is the memory system at avoiding bank conflicts?

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 77/87


PERFORMANCE PARAMETERS IN VECTOR PERFORMANCE

 Asymptotic Performance
 Half performance length

ASYMPTOTIC PERFORMANCE

The potential max performance of the computer can be used to gauge how effective our software is executing on a certain
machine (aka light speed of performance). We can determine what percentage of the theoretical peak we are hitting by
comparing the program's measured performance. A numerical program's peak performance in scientific computing is
expressed in terms of flop/s (floating-point operations per second), or the number of floating-point operations—as opposed to
integer—that is carried out in a second.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 78/87


Estimation of a computer's peak flops is made by multiplying the number of multi-core processors, the number of cores (i.e.,
small processors that share resources like memory with other cores inside a processor), and the peak flops of each core. The
projected number of flops per second for each core is calculated as a function of its clock speed (i.e., how many times the
system clock ticks per second) and the number of operations that can be executed in a clock cycle. A fused multiply-add (FMA)
circuit, which can do one addition and one multiplication operation each clock cycle, is present in the majority of contemporary
CPUs. Additionally, vector registers that each holds multiple operands can be used to accomplish each multiply or add
operation.
Thus, the theoretical flop/s of a computer is computed as

Where,
• #processors is the number of processors that constitute a parallel computer
• #cores_per_processor is the number of cores per multi-core processor
• clock_speed is usually measured in GHz
• #FMA_units is the number of FMA units per core
• the last term is the number of double-precision operands held in each vector register.
CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 79/87
HALF PERFORMANCE LENGTH

The vector length for which performance is half that of the peak performance is known as the half performance length. The
vector start-up time and pipeline depth affect a vector processor's performance. Peak performance becomes exceedingly
challenging to achieve as start-up times and pipeline depths grow. Therefore, it must at least match the n1/2 value, which is
half of the peak performance.
 N1/2 gives a feel for how well shorter vectors perform
• N1/2 measured with respect to R¥ (how big a vector to get half R¥ performance)
• Smaller N1/2 means that the machine achieves a good percentage of peak throughput with short vectors
 N1/2 determined by a combination of:
• Vector unit startup overhead
• Vector unit latency
• Varies depending on the operation being performed

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 80/87


BENCHMARK PERFORMANCE

 Linpack
 Floating Point SPECmarks
 Livermore Loops
 NAS Parallel Benchmarks
 Perfect Club Benchmarks

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 81/87


LINPACK

The floating-point computing power of a system is assessed using the LINPACK Benchmarks. They were developed by Jack
Dongarra to gauge how quickly a computer can tackle the challenging engineering problem of solving the dense n by n system
of linear equations Ax= b.
The goal is to get a rough idea of how quickly a computer will work while tackling actual problems. Since no one computational
task can accurately represent the overall performance of a computer system, it is a simplification. However, the LINPACK
benchmark performance can offer a good adjustment over the manufacturer's peak performance. The peak performance of a
computer is determined by multiplying the number of operations it can complete per cycle by the frequency of the machine,
measured in cycles per second. The peak performance will never match the real performance. [2] Computer performance is a
complicated topic that is influenced by numerous interrelated aspects. The number of 64-bit floating-point operations, often
adds and multiplications, that a computer can execute per second, also known as FLOPS, is the performance assessed by the
LINPACK benchmark. The performance of a computer when executing actual applications, however, is probably going to lag
much behind the maximum performance it obtains when running the relevant LINPACK benchmark.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 82/87


SUSTAINED PERFORMANCE OF VMIPS ON THE LINPACK BENCHMARK

The Linpack benchmark is a Gaussian elimination on a 100 x 100 matrix. Thus, the vector element lengths range from 99 to 1.
A vector of length k is used k times. Thus, the average vector length is given by:

Now we can obtain an accurate estimate of the performance of DAXPY using a vector length of 66.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 83/87


CRAY X1

 Since 2003, Cray Inc. has produced and sold the Cray X1, a supercomputer with a vector processor and non-uniform
memory access. The Cray T90, Cray SV1, and Cray T3E architectures were combined into a single system to create the
X1.
 The X1 shares with the SV1, T3E, and T90 the multistreaming processors, vector caches, and CMOS designs, as well as
the highly scalable distributed memory and liquid cooling designs. The X1 offers a top speed of 12.8 gigaflops per CPU
using an 800 MHz clock cycle and 8-wide vector pipelines. Up to 64 CPUs are offered in air-cooled variants.
 A theoretical maximum of 4096 processors in 32 frames can be accommodated by liquid-cooled systems, made up of 1024
shared memory nodes. A system like that would have a peak speed of 50 teraflops. The Oak Ridge National Laboratory's
512 processor system was the largest unclassified X1 system, though it has subsequently been updated to an X1E system.
 The X1 may be programmed using either shared-memory languages like Unified Parallel C programming language or Co-
array Fortran or with widely used message passing tools like MPI and PVM. The UNICOS/mp operating system used by the
X1 has more in common with the SGI IRIX operating system than it does with the UNICOS used by Cray machines from
earlier generations.

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 84/87


CRAY X1

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 85/87


Any queries

CPSC 7331- COMPUTER ARCHITECTURE- Presentation, Fall 2022 86/87


Thank you

87/87

You might also like