0% found this document useful (0 votes)

137 views34 pages

Lecture13 - Full IS1500

The document discusses parallel computing concepts including SIMD, MIMD, and parallel programming. It covers topics like SIMD, multithreading, and GPUs which exploit data-level parallelism (DLP), as well as MIMD, multicore, and clusters which exploit task-level parallelism (TLP). The document is a lecture on these topics given by Artur Podobas at KTH Royal Institute of Technology.

Uploaded by

qweqweqwe qweqwe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views34 pages

Lecture13 - Full IS1500

Uploaded by

qweqweqwe qweqwe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Computer Hardware Engineering (IS1200)

Computer Organization and Components (IS1500)

Fall 2020
Lecture 13: SIMD, MIMD, and Parallel Programming

Artur Podobas
Researcher, KTH Royal Institute of Technology

Slides by David Broman, KTH (Extensions by Artur Podobas)

Course Structure
Module 1: C and Assembly
Module 4: Processor Design
Programming
LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4

LE4 S1 LAB2

Module 2: I/O Systems Module 5: Memory Hierarchy

LE5 LE6 EX2 LAB3 LE11 EX5 S3

PROJ Module 6: Parallel Processors

Module 3: Logic Design
(IS1500 only) START and Programs
LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4

Proj. Expo LE14

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
3

Abstractions in Computer Systems

Computer System Networked Systems and Systems of Systems

Application Software
Software
Operating System

Instruction Set Architecture Hardware/Software Interface

Microarchitecture

Logic and Building Blocks Digital Hardware Design

Digital Circuits

Analog Circuits
Analog Design and Physics
Devices and Physics

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
4

Agenda

Part I Part II
SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters

DLP TLP

Part III
Parallelization in Practice

DLP + TLP

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
5

Part I
SIMD, Multithreading, and GPUs

DLP

Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
6

SISD, SIMD, and MIMD (Revisited)

Data-level parallelism. Examples

Data Stream are multimedia extensions (e.g.,
SSE, streaming SIMD
Single Multiple extension), vector processors.
Single
Instruction Stream

SISD SIMD Graphical Processing Unit

E.g. Intel (GPUs) are both SIMD and
E.g. SSE MIMD
Pentium 4
Instruction in x86
Multiple

MISD MIMD
Task-level parallelism.
Few examples E.g. Intel Examples are multicore and
(Systolic arrays Core i7 cluster computers
closest)

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
7

Subword Parallelism and

Multimedia Extensions
This is the same as SIMD or
Subword parallelism is when a wide
data-level parallelism.
data word is operated on in parallel.
One instruction operates on
multiple data items.

Instruction 32-bit data 32-bit data 32-bit data 32-bit data

MMX (MultiMedia eXtension), first SIMD by Intel Pentium processors
(introduced 1997). Only on Integers.
3D Now! AMD, included single-precision floating-point (1998)

Subword SSE/2 (Streaming SIMD Extension) introduced by Intel in Pentium III (year
1999). Included single-precision FP.
Parallelism
AVX (Advanced Vector Extension), supported by both Intel and AMD
(processors available in 2011). Added support for 256/512 bits and double-
precision FP.
NEON multimedia extension for ARMv7 and ARMv8
(32 registers 8 bytes wide or 16 registers 16 bytes wide)
SVE specialized scalable vector extension to ARMv8..

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
8

Streaming SIMD Extension (SSE) and E

Advanced Vector Extension (AVX)
In SSE (and the later version SSE2), assembly
instructions are using two-operand format.

meaning: %xmm4 = %xmm4 + %xmm0

addpd %xmm0, %xmm4 Note the reversed order.

Registers (e.g. %xmm4) are 128-bits in SSE/SEE2.

Added the “v” for vector to distinguish

AVX from SSE and renamed registers
“pd” means Packed Double precision FP. It can
to %ymm that are now 256-bit
operate on as many FP that fits in the register
Question: How many FP additions
vaddpd %ymm0, %ymm1, %ymm4 does vaddpd perform in parallel? Answer: 4
vmovapd %ymm4, (%r11)
AVX introduced three-operand format
Moves the result to the memory address stored in
Meaning: %ymm4 = %ymm0 + %ymm1
%r11 (a 64-bit register). Stores the four 64-bit FP
in consecutive order in memory.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
9

Recall the idea of a

multi-issue uniprocesor
Thread A Thread B Thread C

Slot 1
Slot 2
Slot 3 Typically, all functional units cannot
Time be fully utilized in a single-threaded
program (white space is unused
slot/functional unit).

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
10

Hardware Multithreading
In a multithreaded processor, several hardware Thread A Thread B Thread C
threads share the same functional units.
The purpose of multithreading is to hide latencies
and avoid stalls due to cache misses etc.
Coarse-grained multithreading,
switches threads only at costly
Slot 1 stalls, e.g., last-level cache misses.
Slot 2
Cannot overcome throughput
Slot 3 losses in short stalls.
Time

Fine-grained multithreading
Slot 1
switches between hardware
Slot 2 threads every cycle. Better
Slot 3 utilization.
Time

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
11

Simultaneous multithreading (SMT)

Simultaneous multithreading (SMT) combines Thread A Thread B Thread C
multithreading with a multiple-issue, dynamically
scheduled pipeline.

Can fill in the holes that multiple-

issue cannot utilize with cycles
from other hardware threads. Thus,
better utilization.
Slot 1
Slot 2
Slot 3
Time Example: Hyper-threading is
Intel's name and implementation of
SMT. That is why a processor can
have 2 real cores, but the OS
shows 4 cores (4 hardware
threads).

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
12

Graphical Processing Units (GPUs)

A Graphical Processing Unit (GPU) utilizes multithreading,
MIMD, SIMD, and ILP. The main form of parallelism that can
be used is data-level parallelism.
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and programming model
from NVIDIA.
CUDA The parallelism is expressed as CUDA threads.
GPU Therefore, the model is also called
Single Instruction Multiple Thread (SIMT).

A GPU consists of a set of multithreaded SIMD

processors (called streaming multiprocessor using
NVIDIA terms). For instance 16 processors.

The main idea is to execute a massive number of threads and to use

multithreading to hide latency. However, the latest GPUs also include caches.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
13

Part II
MIMD, Multicore, and Clusters

TLP

Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
14

Shared Memory Multiprocessor (SMP)

A Shared Memory Multiprocessor (SMP) has a single
physical address space across all processors.

An SMP is almost always the same as a multicore processor.

In a uniform memory access (UMA)

multiprocessor, the latency of
accessing memory does not depend Processor Processor Processor
on the processor. Core Core Core

In a nonuniform memory access L1 Cache L1 Cache L1 Cache

(NUMA) multiprocessor, memory can
be divided between processor and L2 Cache
result in different latencies.
Processors (cores) in a SMP
communicate via shared memory. Memory
Alternative: Network on Chip (NoC)

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
15

Cache Coherence
Different cores’ local caches could result in that different cores see
different values for the same memory address.
This is called the cache coherency problem.

Time step 1 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2

Cache 0 Cache Cache Cache Cache Cache

0 1 0
0
Memory Memory Memory 1
0 0

Core 2 reads Core 1

Core 1 reads
memory position X. writes to Core 2 sees
memory position X.
The value is stored memory. the incorrect
The value is stored
in Core 1’s cache. in Core 2’s cache. value.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
16

Snooping Protocol
Cache coherence can be enforced using a cache coherence protocol. For
instance a write invalidate protocol, such as the snooping protocol.

Time step 2 Time step 2 Time step 3

Processor Processor Processor Processor Processor Processor

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2

Cache 0 Cache 0 Cache Cache Cache Cache

1 1
1
Memory Memory Memory 1
0 1

Core 2 reads Core 2 now tries to read the

memory position X. The write
Core 1 invalidates variable, it gets a cache miss
The value is stored and loads the new value from
in Core 2’s cache. writes to the cache
memory. line of other memory (heavily simplified
processors. example)

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
17

False Sharing
Assume that Core 1 and Core 2 share a
cache line Z (the same set).

Processor Core 1 Processor Core 2

Core 1 reads and writes to X and

Cache Cache
Core 2 reads and writes to Y.
Cache line Z Cache line Z

X=1 X=1 This will result in that the

cache coherence protocol
Y=0 Y=0 will invalidate the other
core’s cache line, even if the
cores are not interested in
the other ones variable!
Memory

This is called false sharing.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
18

Processes, Threads, and Cores

A modern operating system (OS) Concurrent threads are
can execute several processes scheduled by the OS to execute
concurrently. in parallel on different cores.
Thread

Process Thread C C C C

C C C C
Operating Thread
System Process Memory

Thread
Note: All threads share the process
Process context, including virtual memory etc.

A process context include its own Each process can have N number of
Hands-on:
virtual memory space, IO files, read- concurrent threads. The thread context Activity
only code, heap, shared library, includes thread ID, stack, stack pointer, Monitor
process id (PID) etc. program counter etc.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
19

Programming with Threads and E

Shared Variables
POSIX threads (pthreads) is a common way of programming
concurrency and utilizing multicores for parallel computation.

#include <stdio.h> int main(){

#include <pthread.h> pthread_t tid1, tid2;
int max;
volatile int counter = 0; max = 40000;
pthread_create(&tid1, NULL, count, &max);
void *count(void *data){
int i; max = 60000;
int max = *((int*)data); pthread_create(&tid2, NULL, count, &max);
for(i=0; i<max; i++)
counter++; pthread_join(tid1, NULL);
pthread_exit(NULL); pthread_join(tid2, NULL);
} printf("counter = %d\n", counter);
pthread_exit(NULL);
} Hands-on:
Show
Exercise: What is the output? example
Creates two threads, each is
counting a shared variable. Answer: Possibly different values each time…

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
20

Semaphores
A semaphore is a global variable that can hold a nonnegative integer
value. It can only be changed by the following two operations.
Note that the check
and return of P(s)
P(s): If s > 0, then decrement s and return.
P(s) and increment of
If s = 0, then wait until s > 0, then decrement
V(s) must be atomic,
s and return.
meaning that
appears to be
V(s): Increment s. “instantaneously”.
V(s)

Semaphores were invented by Edsger Dijkstra, who was originally from

the Nederland. P and V is supposed to stand for
Prolaag (probeer te verlagen, “try to reduce”) and Verhogen (increase).

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
21

Mutex
A semaphore can be used for mutual exclusion, meaning that only one thread can access a
particular resource at the same time. Such a binary semaphore is called a mutex.

A global binary semaphore is initiated to 1.

One or more threads are executing code that

semaphore s = 1
needs to be protected.
One of more threads execute: P(s), also called wait(s), checks if the semaphore
P(s); is nonzero. If so, lock the mutex, else wait.
Code to
protected...
In the critical section, it is ensured that not more than
V(s);
one thread can execute the code at the same time.

V(s), also called post, unlocks the mutex and

increments the semaphore.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
22

Programming with Threads and E

Shared Variables with Semaphores
Problem. We update the value
max, that is also shared…
volatile int counter = 0; int main(){
sem_t *mutex; pthread_t tid1, tid2;
int max;
void *count(void *data){
int i; mutex = sem_open("/semaphore", O_CREAT,
int max = *((int*)data); O_RDWR, 1);
for(i=0; i<max; i++){ sem_unlink("/semaphore");
sem_wait(mutex); /* P()*/ max = 40000;
counter++; pthread_create(&tid1, NULL, count, &max);
sem_post(mutex); /* V(m)*/ max = 60000;
} pthread_create(&tid2, NULL, count, &max);
pthread_exit(NULL);
} pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
Exercise: Is it correct printf("counter = %d\n", counter);
Hands-on: this time? sem_close(mutex);
Show
example
pthread_exit(NULL);
}
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
23

Programming with Threads and E

Shared Variables with Semaphores
Correct solution… Simple solution. Use different
variables.

volatile int counter = 0; int main(){

sem_t *mutex; pthread_t tid1, tid2;
int max1 = 40000;
void *count(void *data){ int max2 = 60000;
int i;
int max = *((int*)data); mutex = sem_open("/semaphore", O_CREAT,
for(i=0; i<max; i++){ 0777, 1);
sem_wait(mutex); /*P()*/ sem_unlink("/semaphore");
counter++; pthread_create(&tid1, NULL, count, &max1);
sem_post(mutex); /*V(m)*/ pthread_create(&tid2, NULL, count, &max2);
}
pthread_exit(NULL); pthread_join(tid1, NULL);
} pthread_join(tid2, NULL);
printf("counter = %d\n", counter);
sem_close(mutex);
Hands-on: pthread_exit(NULL);
Show
example
}

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
24

Clusters and Warehouse Scale Computers

A cluster is a set of computers that are
connected over a local area network (LAN).
May be viewed as one large multiprocessor.
Warehouse-Scale Computers are very large cluster that
can include 100 000 servers that act as one giant computer
Photo by Robert Harker
(e.g., Facebook, Google, Apple).
Clusters do not communicate over shared memory (as
for SMP) but are using message passing.

MapReduce is a programming model that is

Computer
popular for batch processing.
1 Computer
N 1. Map applies a programmer defined function on all
data items.
2. Reduce collects the output and collapse the data
Computer
2
using another programmer defined function.
Computer The map step is highly parallel. The reduce stage may
N-1
be parallelized to some extent.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
25

Supercomputers
Similar to a cluster but with focus on high-performance:
• Used to solve tough, real-life problems
• Medicine, Weather, Fluid-Dynamics, AI, …
• Fast inter-node communication
• Infiniband, Tofu, Slingshot, etc.
• Large amount of memory bandwidth (HBM2, etc.)
• Non-volatile in-memory storage (Burst-buffers, think
caches but for I/O)
• Performance measured in FLOP/s (double-precision) The Top500 list (www.top500.org)
• Programmed using different models
• e.g., OpenMP for intra-node (shared memory)
• Message Passing Interface (MPI) inter-node

Flow around landing gear of a private jet

The KTH Beskow Supercomputer at PDC. (Photo: Niclas Jansson)

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
26

Part III
Parallelization in Practice

DLP + TLP

Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
27

General Matrix Multiplication (GEMM)

Uses matrix size n as a
Simple matrix multiplication parameter and single
dimension for
performance.

void dgemm(int n, double* A, double* B, double* C){

for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j){
double cij = C[i+j*n];
for(int k = 0; k < n; k++)
cij += A[i+k*n] * B[k+j*n];
C[i+j*n] = cij;
}
}

Hands-on:
Show
example
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
28

Parallelizing GEMM
Unoptimized Unoptimized C version (previous page). Using 1.7 GigaFLOPS (32x32)
one core. 0.8 GigaFLOPS (960x960)

Use AVX instructions vaddpd and vmulpd 6.4 GigaFLOPS (32x32)

SIMD to do 4 double precision floating-point 2.5 GigaFLOPS (960x960)
operations in parallel.
.
AVX + unroll parts of the loop, so that the 14.6 GigaFLOPS (32x32)
ILP multiple-issue, out-of-order processor have 5.1 GigaFLOPS (960x960)
more instructions to schedule.

AVX + unroll + blocking (dividing the problem 13.6 GigaFLOPS (32x32)

Cache 12.0 GigaFLOPS (960x960)
into submatrices). This avoids cache misses.

23 GigaFLOPS (960x960, 2 cores)

Multi- AVX + unroll + blocking + multi core
core 44 GigaFLOPS (960x960, 4 cores)
(multithreading using OpenMP)
174 GigaFLOPS (960x960, 16 cores)
Experiment by P&H on a 2.6GHz Intel Core i7 with Turbo mode turned off.
For details see P&H, 5th edtion, sections 3.8, 4.12, 5.14, and 6.12
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
29

Future perspective (Part 1 of 3):

MIMD, SIMD, ILP, and Caches

“For x86 computers, we expect to see two additional cores per chip
every two years and the SIMD width to double every four years.”
Hennessy & Patterson, Computer Architecture – A
Quantitative Approach, 5th edition, 2013 (page 263)

We must understand and utilize both MIMD and

SIMD to gain maximal speedups in the future,
although MIMD (multicore) has gained much more
attention lately.

The previous example showed that the way we

program for ILP and caches, also matters
significantly.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
30

Future perspective (Part 2 of 3):

Heterogeneity and Accelerators
Heterogeneity
Different architectures good for different things:
❑ General-purpose systems (CPUs)
❑ Latency-critical applications
❑ Large/deep memory hierarchies
❑ Out-of-order execution
❑ Newer systems (e.g., ARM A64FX) offer high performance
❑ Accelerators Fig. source: Wikipedia

❑ Graphics Processing Units (GPUs)

❑ Throughput (not latency) oriented
❑ High amount of thread-level parallelism
❑ Programmed in special languages (CUDA/OpenCL/HIP/…)
❑ Field-Programmable Gate Arrays (FPGAs)
❑ Reconfigurable architectures (can take on many forms)
❑ Spatial computing (data-flow)
❑ Programmed using High-Level Synthesis (HLS) or HDLs
❑ Gaining in importance
❑ Intel acquired Altera (June, 2015)
❑ AMD acquiring Xilinx (this year)
❑ Nvidia to acquire ARM (?) } Exciting (Heterogeneous) future ahead of us

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
31

Future perspective (Part 3 of 3):

Specialization 1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
Matrix B 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
Architectural Specialization 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
❑ Move away from ”one size fits all” 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
❑ Specialize architecture (or silicon) toward a particular application

stream
domain B[x][y-1]

❑ Exploit alternative compute paradigms (e.g., data-flow computing) A[x-1][y] A[x][y]

1 0 0 0 0 0 0 1
Example 1: Cerebras CS-1 Matrix A 3 0 0 3 0 0 4 0
4 3 0 0 0 4 0 0 stream MAC
❑ ”Wafer”-scale chip 0 4 2 0 0 0 0 2
0 0 4 2 1 0 0 0
❑ 46,225 mm2 (56x larger than a GPU) 0 0 0 4 2 1 0 0
Systolic Array C[x][y]

drain
❑ 400,000 cores (tailored for AI training) 0
0
6
0
0
0
6
0
5
5
2
3
1
1
0
1
B[x][y]

Example 2: Matrix Engines C=AxB 0 0 0 0 0 0 0 0

Matrix C 0 0 0 0 0 0 0 0
❑ Matrix multiplication ”claimed” to be a common workload (e.g., AI) 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

❑ Many modern processors (Intel Sapphire Rapids, IBM Power 10, 0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Nvidia Volta/Ampere) include hardware support for MxM 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(implemented as systolic arrays)

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
32

It’s about time to…

…relax

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
33

Reading Guidelines

Module 6: Parallel Processors and Programs

Lecture 12: Parallelism, Concurrency, Speedup, and ILP

• H&H Chapter 1.8, 3.6, 7.8.3-7.8.5
• P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2
or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2

Lecture 13: SIMD, MIMD, and Parallel Programming

• H&H Chapter 7.8.6-7.8.9
• P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7
or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7
Reading Guidelines
See the course webpage
for more information.

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
34

Summary

Some key take away points:

• SIMD and GPUs can efficiently parallelize

problems that have data-level parallelism

• MIMD, Multicores, and Clusters can be used to

parallelize problems that have task-level parallelism.

• In the future, we should try to combine and use both

SIMD and MIMD!

Thanks for listening!

Part I Part II Part III

Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice

Section 2 TR
No ratings yet
Section 2 TR
26 pages
Module 1
No ratings yet
Module 1
63 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
MIMD Interpretation On A GPU
No ratings yet
MIMD Interpretation On A GPU
15 pages
ILP MThread
No ratings yet
ILP MThread
3 pages
MODULE - III - PPTM
No ratings yet
MODULE - III - PPTM
48 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Unit IV CA
No ratings yet
Unit IV CA
73 pages
Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
MCAP
No ratings yet
MCAP
32 pages
Architecture
No ratings yet
Architecture
67 pages
Multivector and SIMD Computers: Unleashing Parallelism
No ratings yet
Multivector and SIMD Computers: Unleashing Parallelism
8 pages
Unit 1
No ratings yet
Unit 1
21 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
126 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Comparison of Multimedia SIMD, GPUs and Vector
No ratings yet
Comparison of Multimedia SIMD, GPUs and Vector
13 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Notes FT HA
No ratings yet
Notes FT HA
4 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Coa Unit-3,4 Notes
No ratings yet
Coa Unit-3,4 Notes
17 pages
Cs8083 Notes Mcap
No ratings yet
Cs8083 Notes Mcap
187 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
Multivector and SIMD Computers: Unleashing Parallelism
No ratings yet
Multivector and SIMD Computers: Unleashing Parallelism
11 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
GPU v1.1
No ratings yet
GPU v1.1
25 pages
Parallelizing The Naughty Dog Engine Using Fibers
No ratings yet
Parallelizing The Naughty Dog Engine Using Fibers
94 pages
Multicore Architecture Insights
No ratings yet
Multicore Architecture Insights
186 pages
Unit 2 Lecture 28 - Mutual Exclusion and Synchronization
No ratings yet
Unit 2 Lecture 28 - Mutual Exclusion and Synchronization
17 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
ACA1
No ratings yet
ACA1
29 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Aca Unit 1.1
No ratings yet
Aca Unit 1.1
20 pages
MCQ BCA 2 Year Unit-2 Operating System.
No ratings yet
MCQ BCA 2 Year Unit-2 Operating System.
26 pages
Parallel Computing for CS Students
No ratings yet
Parallel Computing for CS Students
9 pages
Cpscheduling
No ratings yet
Cpscheduling
58 pages
A Comprehensive Survey of Various Processor Types & Latest Architectures
No ratings yet
A Comprehensive Survey of Various Processor Types & Latest Architectures
7 pages
Unit 1
No ratings yet
Unit 1
48 pages
CH2 Deadlocks
No ratings yet
CH2 Deadlocks
14 pages
CS621 Week 1
No ratings yet
CS621 Week 1
30 pages
Query Parallelism
No ratings yet
Query Parallelism
8 pages
Lec 5
No ratings yet
Lec 5
14 pages
CS8493-Operating Systems-All Units Questions
No ratings yet
CS8493-Operating Systems-All Units Questions
17 pages
JPL Assignment 05
No ratings yet
JPL Assignment 05
8 pages
CPU Scheduling for Students
No ratings yet
CPU Scheduling for Students
31 pages
7TH - Unit 3-21ec74h6 - Ca
No ratings yet
7TH - Unit 3-21ec74h6 - Ca
45 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
Unit-5 Multithreading
No ratings yet
Unit-5 Multithreading
24 pages
Esd Important Questions For Mid-2
No ratings yet
Esd Important Questions For Mid-2
12 pages
03 - Synchronization
No ratings yet
03 - Synchronization
37 pages
IPC and Synchronization in C Programs
No ratings yet
IPC and Synchronization in C Programs
25 pages
Ps
No ratings yet
Ps
48 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
No ratings yet
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
20 pages
OS Numericals
No ratings yet
OS Numericals
22 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
TUGAS 1 - MULTITHREADING EXPERIMENT - 4520210014 - Muhammad Aldiansyah
No ratings yet
TUGAS 1 - MULTITHREADING EXPERIMENT - 4520210014 - Muhammad Aldiansyah
46 pages
Deadlocks 0
No ratings yet
Deadlocks 0
13 pages
Answer Chapter 3
No ratings yet
Answer Chapter 3
3 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
5 pages
L13 - Modern Processors
No ratings yet
L13 - Modern Processors
19 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
Parallel Programming Using Openmp: Mike Bailey
No ratings yet
Parallel Programming Using Openmp: Mike Bailey
27 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
CS802A Lec-2 PDF
No ratings yet
CS802A Lec-2 PDF
28 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Practical Issues of Database Application
No ratings yet
Practical Issues of Database Application
41 pages
GATE Questions
No ratings yet
GATE Questions
90 pages
Lab 10 PDF
No ratings yet
Lab 10 PDF
6 pages
Introduction To Parallel Processing (TempusCh1)
No ratings yet
Introduction To Parallel Processing (TempusCh1)
42 pages
The Earth Simulator: Presented by Jin Soon Lim For CS 566
No ratings yet
The Earth Simulator: Presented by Jin Soon Lim For CS 566
29 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Multicore Programming with OpenMP
No ratings yet
Multicore Programming with OpenMP
124 pages