Computer Hardware Engineering (IS1200)
Computer Organization and Components (IS1500)
Fall 2020
Lecture 13: SIMD, MIMD, and Parallel Programming
Artur Podobas
Researcher, KTH Royal Institute of Technology
Slides by David Broman, KTH (Extensions by Artur Podobas)
2
Course Structure
Module 1: C and Assembly
Module 4: Processor Design
Programming
LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4
LE4 S1 LAB2
Module 2: I/O Systems Module 5: Memory Hierarchy
LE5 LE6 EX2 LAB3 LE11 EX5 S3
PROJ Module 6: Parallel Processors
Module 3: Logic Design
(IS1500 only) START and Programs
LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4
Proj. Expo LE14
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
3
Abstractions in Computer Systems
Computer System Networked Systems and Systems of Systems
Application Software
Software
Operating System
Instruction Set Architecture Hardware/Software Interface
Microarchitecture
Logic and Building Blocks Digital Hardware Design
Digital Circuits
Analog Circuits
Analog Design and Physics
Devices and Physics
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
4
Agenda
Part I Part II
SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters
DLP TLP
Part III
Parallelization in Practice
DLP + TLP
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
5
Part I
SIMD, Multithreading, and GPUs
DLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
6
SISD, SIMD, and MIMD (Revisited)
Data-level parallelism. Examples
Data Stream are multimedia extensions (e.g.,
SSE, streaming SIMD
Single Multiple extension), vector processors.
Single
Instruction Stream
SISD SIMD Graphical Processing Unit
E.g. Intel (GPUs) are both SIMD and
E.g. SSE MIMD
Pentium 4
Instruction in x86
Multiple
MISD MIMD
Task-level parallelism.
Few examples E.g. Intel Examples are multicore and
(Systolic arrays Core i7 cluster computers
closest)
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
7
Subword Parallelism and
Multimedia Extensions
This is the same as SIMD or
Subword parallelism is when a wide
data-level parallelism.
data word is operated on in parallel.
One instruction operates on
multiple data items.
Instruction 32-bit data 32-bit data 32-bit data 32-bit data
MMX (MultiMedia eXtension), first SIMD by Intel Pentium processors
(introduced 1997). Only on Integers.
3D Now! AMD, included single-precision floating-point (1998)
Subword SSE/2 (Streaming SIMD Extension) introduced by Intel in Pentium III (year
1999). Included single-precision FP.
Parallelism
AVX (Advanced Vector Extension), supported by both Intel and AMD
(processors available in 2011). Added support for 256/512 bits and double-
precision FP.
NEON multimedia extension for ARMv7 and ARMv8
(32 registers 8 bytes wide or 16 registers 16 bytes wide)
SVE specialized scalable vector extension to ARMv8..
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
8
Streaming SIMD Extension (SSE) and E
Advanced Vector Extension (AVX)
In SSE (and the later version SSE2), assembly
instructions are using two-operand format.
meaning: %xmm4 = %xmm4 + %xmm0
addpd %xmm0, %xmm4 Note the reversed order.
Registers (e.g. %xmm4) are 128-bits in SSE/SEE2.
Added the “v” for vector to distinguish
AVX from SSE and renamed registers
“pd” means Packed Double precision FP. It can
to %ymm that are now 256-bit
operate on as many FP that fits in the register
Question: How many FP additions
vaddpd %ymm0, %ymm1, %ymm4 does vaddpd perform in parallel? Answer: 4
vmovapd %ymm4, (%r11)
AVX introduced three-operand format
Moves the result to the memory address stored in
Meaning: %ymm4 = %ymm0 + %ymm1
%r11 (a 64-bit register). Stores the four 64-bit FP
in consecutive order in memory.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
9
Recall the idea of a
multi-issue uniprocesor
Thread A Thread B Thread C
Slot 1
Slot 2
Slot 3 Typically, all functional units cannot
Time be fully utilized in a single-threaded
program (white space is unused
slot/functional unit).
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
10
Hardware Multithreading
In a multithreaded processor, several hardware Thread A Thread B Thread C
threads share the same functional units.
The purpose of multithreading is to hide latencies
and avoid stalls due to cache misses etc.
Coarse-grained multithreading,
switches threads only at costly
Slot 1 stalls, e.g., last-level cache misses.
Slot 2
Cannot overcome throughput
Slot 3 losses in short stalls.
Time
Fine-grained multithreading
Slot 1
switches between hardware
Slot 2 threads every cycle. Better
Slot 3 utilization.
Time
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
11
Simultaneous multithreading (SMT)
Simultaneous multithreading (SMT) combines Thread A Thread B Thread C
multithreading with a multiple-issue, dynamically
scheduled pipeline.
Can fill in the holes that multiple-
issue cannot utilize with cycles
from other hardware threads. Thus,
better utilization.
Slot 1
Slot 2
Slot 3
Time Example: Hyper-threading is
Intel's name and implementation of
SMT. That is why a processor can
have 2 real cores, but the OS
shows 4 cores (4 hardware
threads).
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
12
Graphical Processing Units (GPUs)
A Graphical Processing Unit (GPU) utilizes multithreading,
MIMD, SIMD, and ILP. The main form of parallelism that can
be used is data-level parallelism.
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and programming model
from NVIDIA.
CUDA The parallelism is expressed as CUDA threads.
GPU Therefore, the model is also called
Single Instruction Multiple Thread (SIMT).
A GPU consists of a set of multithreaded SIMD
processors (called streaming multiprocessor using
NVIDIA terms). For instance 16 processors.
The main idea is to execute a massive number of threads and to use
multithreading to hide latency. However, the latest GPUs also include caches.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
13
Part II
MIMD, Multicore, and Clusters
TLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
14
Shared Memory Multiprocessor (SMP)
A Shared Memory Multiprocessor (SMP) has a single
physical address space across all processors.
An SMP is almost always the same as a multicore processor.
In a uniform memory access (UMA)
multiprocessor, the latency of
accessing memory does not depend Processor Processor Processor
on the processor. Core Core Core
In a nonuniform memory access L1 Cache L1 Cache L1 Cache
(NUMA) multiprocessor, memory can
be divided between processor and L2 Cache
result in different latencies.
Processors (cores) in a SMP
communicate via shared memory. Memory
Alternative: Network on Chip (NoC)
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
15
Cache Coherence
Different cores’ local caches could result in that different cores see
different values for the same memory address.
This is called the cache coherency problem.
Time step 1 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache 0 Cache Cache Cache Cache Cache
0 1 0
0
Memory Memory Memory 1
0 0
Core 2 reads Core 1
Core 1 reads
memory position X. writes to Core 2 sees
memory position X.
The value is stored memory. the incorrect
The value is stored
in Core 1’s cache. in Core 2’s cache. value.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
16
Snooping Protocol
Cache coherence can be enforced using a cache coherence protocol. For
instance a write invalidate protocol, such as the snooping protocol.
Time step 2 Time step 2 Time step 3
Processor Processor Processor Processor Processor Processor
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Cache 0 Cache 0 Cache Cache Cache Cache
1 1
1
Memory Memory Memory 1
0 1
Core 2 reads Core 2 now tries to read the
memory position X. The write
Core 1 invalidates variable, it gets a cache miss
The value is stored and loads the new value from
in Core 2’s cache. writes to the cache
memory. line of other memory (heavily simplified
processors. example)
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
17
False Sharing
Assume that Core 1 and Core 2 share a
cache line Z (the same set).
Processor Core 1 Processor Core 2
Core 1 reads and writes to X and
Cache Cache
Core 2 reads and writes to Y.
Cache line Z Cache line Z
X=1 X=1 This will result in that the
cache coherence protocol
Y=0 Y=0 will invalidate the other
core’s cache line, even if the
cores are not interested in
the other ones variable!
Memory
This is called false sharing.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
18
Processes, Threads, and Cores
A modern operating system (OS) Concurrent threads are
can execute several processes scheduled by the OS to execute
concurrently. in parallel on different cores.
Thread
Process Thread C C C C
C C C C
Operating Thread
System Process Memory
Thread
Note: All threads share the process
Process context, including virtual memory etc.
A process context include its own Each process can have N number of
Hands-on:
virtual memory space, IO files, read- concurrent threads. The thread context Activity
only code, heap, shared library, includes thread ID, stack, stack pointer, Monitor
process id (PID) etc. program counter etc.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
19
Programming with Threads and E
Shared Variables
POSIX threads (pthreads) is a common way of programming
concurrency and utilizing multicores for parallel computation.
#include <stdio.h> int main(){
#include <pthread.h> pthread_t tid1, tid2;
int max;
volatile int counter = 0; max = 40000;
pthread_create(&tid1, NULL, count, &max);
void *count(void *data){
int i; max = 60000;
int max = *((int*)data); pthread_create(&tid2, NULL, count, &max);
for(i=0; i<max; i++)
counter++; pthread_join(tid1, NULL);
pthread_exit(NULL); pthread_join(tid2, NULL);
} printf("counter = %d\n", counter);
pthread_exit(NULL);
} Hands-on:
Show
Exercise: What is the output? example
Creates two threads, each is
counting a shared variable. Answer: Possibly different values each time…
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
20
Semaphores
A semaphore is a global variable that can hold a nonnegative integer
value. It can only be changed by the following two operations.
Note that the check
and return of P(s)
P(s): If s > 0, then decrement s and return.
P(s) and increment of
If s = 0, then wait until s > 0, then decrement
V(s) must be atomic,
s and return.
meaning that
appears to be
V(s): Increment s. “instantaneously”.
V(s)
Semaphores were invented by Edsger Dijkstra, who was originally from
the Nederland. P and V is supposed to stand for
Prolaag (probeer te verlagen, “try to reduce”) and Verhogen (increase).
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
21
Mutex
A semaphore can be used for mutual exclusion, meaning that only one thread can access a
particular resource at the same time. Such a binary semaphore is called a mutex.
A global binary semaphore is initiated to 1.
One or more threads are executing code that
semaphore s = 1
needs to be protected.
One of more threads execute: P(s), also called wait(s), checks if the semaphore
P(s); is nonzero. If so, lock the mutex, else wait.
Code to
protected...
In the critical section, it is ensured that not more than
V(s);
one thread can execute the code at the same time.
V(s), also called post, unlocks the mutex and
increments the semaphore.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
22
Programming with Threads and E
Shared Variables with Semaphores
Problem. We update the value
max, that is also shared…
volatile int counter = 0; int main(){
sem_t *mutex; pthread_t tid1, tid2;
int max;
void *count(void *data){
int i; mutex = sem_open("/semaphore", O_CREAT,
int max = *((int*)data); O_RDWR, 1);
for(i=0; i<max; i++){ sem_unlink("/semaphore");
sem_wait(mutex); /* P()*/ max = 40000;
counter++; pthread_create(&tid1, NULL, count, &max);
sem_post(mutex); /* V(m)*/ max = 60000;
} pthread_create(&tid2, NULL, count, &max);
pthread_exit(NULL);
} pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
Exercise: Is it correct printf("counter = %d\n", counter);
Hands-on: this time? sem_close(mutex);
Show
example
pthread_exit(NULL);
}
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
23
Programming with Threads and E
Shared Variables with Semaphores
Correct solution… Simple solution. Use different
variables.
volatile int counter = 0; int main(){
sem_t *mutex; pthread_t tid1, tid2;
int max1 = 40000;
void *count(void *data){ int max2 = 60000;
int i;
int max = *((int*)data); mutex = sem_open("/semaphore", O_CREAT,
for(i=0; i<max; i++){ 0777, 1);
sem_wait(mutex); /*P()*/ sem_unlink("/semaphore");
counter++; pthread_create(&tid1, NULL, count, &max1);
sem_post(mutex); /*V(m)*/ pthread_create(&tid2, NULL, count, &max2);
}
pthread_exit(NULL); pthread_join(tid1, NULL);
} pthread_join(tid2, NULL);
printf("counter = %d\n", counter);
sem_close(mutex);
Hands-on: pthread_exit(NULL);
Show
example
}
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
24
Clusters and Warehouse Scale Computers
A cluster is a set of computers that are
connected over a local area network (LAN).
May be viewed as one large multiprocessor.
Warehouse-Scale Computers are very large cluster that
can include 100 000 servers that act as one giant computer
Photo by Robert Harker
(e.g., Facebook, Google, Apple).
Clusters do not communicate over shared memory (as
for SMP) but are using message passing.
MapReduce is a programming model that is
Computer
popular for batch processing.
1 Computer
N 1. Map applies a programmer defined function on all
data items.
2. Reduce collects the output and collapse the data
Computer
2
using another programmer defined function.
Computer The map step is highly parallel. The reduce stage may
N-1
be parallelized to some extent.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
25
Supercomputers
Similar to a cluster but with focus on high-performance:
• Used to solve tough, real-life problems
• Medicine, Weather, Fluid-Dynamics, AI, …
• Fast inter-node communication
• Infiniband, Tofu, Slingshot, etc.
• Large amount of memory bandwidth (HBM2, etc.)
• Non-volatile in-memory storage (Burst-buffers, think
caches but for I/O)
• Performance measured in FLOP/s (double-precision) The Top500 list (www.top500.org)
• Programmed using different models
• e.g., OpenMP for intra-node (shared memory)
• Message Passing Interface (MPI) inter-node
Flow around landing gear of a private jet
The KTH Beskow Supercomputer at PDC. (Photo: Niclas Jansson)
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
26
Part III
Parallelization in Practice
DLP + TLP
Acknowledgement: The structure and several of the good examples are derived from the book
“Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
27
General Matrix Multiplication (GEMM)
Uses matrix size n as a
Simple matrix multiplication parameter and single
dimension for
performance.
void dgemm(int n, double* A, double* B, double* C){
for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j){
double cij = C[i+j*n];
for(int k = 0; k < n; k++)
cij += A[i+k*n] * B[k+j*n];
C[i+j*n] = cij;
}
}
Hands-on:
Show
example
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
28
Parallelizing GEMM
Unoptimized Unoptimized C version (previous page). Using 1.7 GigaFLOPS (32x32)
one core. 0.8 GigaFLOPS (960x960)
Use AVX instructions vaddpd and vmulpd 6.4 GigaFLOPS (32x32)
SIMD to do 4 double precision floating-point 2.5 GigaFLOPS (960x960)
operations in parallel.
.
AVX + unroll parts of the loop, so that the 14.6 GigaFLOPS (32x32)
ILP multiple-issue, out-of-order processor have 5.1 GigaFLOPS (960x960)
more instructions to schedule.
AVX + unroll + blocking (dividing the problem 13.6 GigaFLOPS (32x32)
Cache 12.0 GigaFLOPS (960x960)
into submatrices). This avoids cache misses.
23 GigaFLOPS (960x960, 2 cores)
Multi- AVX + unroll + blocking + multi core
core 44 GigaFLOPS (960x960, 4 cores)
(multithreading using OpenMP)
174 GigaFLOPS (960x960, 16 cores)
Experiment by P&H on a 2.6GHz Intel Core i7 with Turbo mode turned off.
For details see P&H, 5th edtion, sections 3.8, 4.12, 5.14, and 6.12
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
29
Future perspective (Part 1 of 3):
MIMD, SIMD, ILP, and Caches
“For x86 computers, we expect to see two additional cores per chip
every two years and the SIMD width to double every four years.”
Hennessy & Patterson, Computer Architecture – A
Quantitative Approach, 5th edition, 2013 (page 263)
We must understand and utilize both MIMD and
SIMD to gain maximal speedups in the future,
although MIMD (multicore) has gained much more
attention lately.
The previous example showed that the way we
program for ILP and caches, also matters
significantly.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
30
Future perspective (Part 2 of 3):
Heterogeneity and Accelerators
Heterogeneity
Different architectures good for different things:
❑ General-purpose systems (CPUs)
❑ Latency-critical applications
❑ Large/deep memory hierarchies
❑ Out-of-order execution
❑ Newer systems (e.g., ARM A64FX) offer high performance
❑ Accelerators Fig. source: Wikipedia
❑ Graphics Processing Units (GPUs)
❑ Throughput (not latency) oriented
❑ High amount of thread-level parallelism
❑ Programmed in special languages (CUDA/OpenCL/HIP/…)
❑ Field-Programmable Gate Arrays (FPGAs)
❑ Reconfigurable architectures (can take on many forms)
❑ Spatial computing (data-flow)
❑ Programmed using High-Level Synthesis (HLS) or HDLs
❑ Gaining in importance
❑ Intel acquired Altera (June, 2015)
❑ AMD acquiring Xilinx (this year)
❑ Nvidia to acquire ARM (?) } Exciting (Heterogeneous) future ahead of us
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
31
Future perspective (Part 3 of 3):
Specialization 1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
Matrix B 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
Architectural Specialization 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
❑ Move away from ”one size fits all” 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
❑ Specialize architecture (or silicon) toward a particular application
stream
domain B[x][y-1]
❑ Exploit alternative compute paradigms (e.g., data-flow computing) A[x-1][y] A[x][y]
1 0 0 0 0 0 0 1
Example 1: Cerebras CS-1 Matrix A 3 0 0 3 0 0 4 0
4 3 0 0 0 4 0 0 stream MAC
❑ ”Wafer”-scale chip 0 4 2 0 0 0 0 2
0 0 4 2 1 0 0 0
❑ 46,225 mm2 (56x larger than a GPU) 0 0 0 4 2 1 0 0
Systolic Array C[x][y]
drain
❑ 400,000 cores (tailored for AI training) 0
0
6
0
0
0
6
0
5
5
2
3
1
1
0
1
B[x][y]
Example 2: Matrix Engines C=AxB 0 0 0 0 0 0 0 0
Matrix C 0 0 0 0 0 0 0 0
❑ Matrix multiplication ”claimed” to be a common workload (e.g., AI) 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
❑ Many modern processors (Intel Sapphire Rapids, IBM Power 10, 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Nvidia Volta/Ampere) include hardware support for MxM 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(implemented as systolic arrays)
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
32
It’s about time to…
…relax
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
33
Reading Guidelines
Module 6: Parallel Processors and Programs
Lecture 12: Parallelism, Concurrency, Speedup, and ILP
• H&H Chapter 1.8, 3.6, 7.8.3-7.8.5
• P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2
or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2
Lecture 13: SIMD, MIMD, and Parallel Programming
• H&H Chapter 7.8.6-7.8.9
• P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7
or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7
Reading Guidelines
See the course webpage
for more information.
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice
34
Summary
Some key take away points:
• SIMD and GPUs can efficiently parallelize
problems that have data-level parallelism
• MIMD, Multicores, and Clusters can be used to
parallelize problems that have task-level parallelism.
• In the future, we should try to combine and use both
SIMD and MIMD!
Thanks for listening!
Part I Part II Part III
Artur Podobas SIMD, Multithreading, MIMD, Multicore, Parallelization
[email protected] and GPUs and Clusters in Practice