CS516: Parallelization of Programs
Overview of Parallel Architectures
Vishwesh Jatala
Assistant Professor
Department of CSE
Indian Institute of Technology Bhilai
[email protected]
2023-24 W
1
Recap: Why Parallel Architectures?
• Moore’s Law: The number of transistors on a IC doubles about every two years
2
Recap: Moore’s Law Effect
3
Processor Architecture RoadMap
4
Course Outline
■ Introduction
■ Overview of Parallel Architectures
■ Performance
■ Parallel Programming
• GPUs and CUDA programming
■ Case studies
■ Extracting Parallelism from Sequential Programs Automatically
5
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
6
SISD: Single Instruction, Single Data
• The von Neumann architecture
• Implements a universal Turing machine
• Conforms to serial algorithmic analysis
From http://arstechnica.com/paedia/c/ cpu/part-1/cpu1-1.html
7
SIMD: Single Instruction, Multiple Data
• Single control stream
• All processors operating in lock step
• Fine-grained parallelism
8
SIMD: Single Instruction, Multiple Data
• Example: GPUs
From http://arstechnica.com/paedia/c/c pu/part-1/cpu1-1.html
9
MIMD: Multiple Instructions, Multiple Data
• Most the machines that are prevalent
• Multi-core, SMP, Clusters, NUMA machines, etc.
10
Rest of the today’s lecture…
• Flynn’s classification of computer architecture
11
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
12
MIMD: Shared Memory Multiprocessors
• Tightly coupled multiprocessors
• Shared global memory address space
• Traditional multiprocessing: symmetric multiprocessing (SMP)
• Existing multi-core processors, multithreaded processors
• Programming model similar to uniprocessors (i.e., multitasking uniprocessor) except
• Operations on shared data require synchronization
13
Interconnection Schemes for SMP
14
SMP Architectures
15
UMA: Uniform Memory Access
• All processors have the same uncontended latency to memory
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect
16
UMA: Uniform Memory Access
+ Data placement unimportant/less important (easier to optimize code and make use of available
memory space)
- Scaling the system increases all latencies
- Contention could restrict bandwidth and increase latency
17
How to Scale Shared Memory Machines?
• Two general approaches
• Maintain UMA
• Provide a scalable interconnect to memory
• Scaling system increases memory latency
• Interconnect complete processors with local memory
• NUMA (Non-uniform memory access)
• Local memory faster than remote memory
• Still needs a scalable interconnect for accessing remote memory
18
NUMA: Non Uniform Memory Access
• Shared memory as local versus remote memory
+ Low latency to local memory
- Much higher latency to remote memories
+ Bandwidth to local memory may be higher
- Performance very sensitive to data placement
19
MIMD: Message Passing Architectures
• Loosely coupled multiprocessors
• No shared global memory address space
• Multicomputer network
• Network-based multiprocessors
• Usually programmed via message passing
• Explicit calls (send, receive) for communication
20
MIMD: Message Passing Architectures
21
Historical Evolution: 1960s & 70s
• Early MPs
• Mainframes
• Small number of processors
• crossbar interconnect
• UMA
22
Historical Evolution: 1980s
• Bus-Based MPs
• enabler: processor-on-a-board
• economical scaling
• precursor of today’s SMPs
• UMA
23
Historical Evolution: Late 80s, mid 90s
• Large Scale MPs (Massively Parallel
Processors)
• multi-dimensional interconnects
• each node a computer (proc + cache
+ memory)
• NUMA
• still used for “supercomputing”
24
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
25
SIMD: Single Instruction, Multiple Data
• Example: GPUs
From http://arstechnica.com/paedia/c/c pu/part-1/cpu1-1.html
26
Data Parallel Programming Model
• Programming Model
• Operations are performed on each element of a large (regular) data
structure (array, vector, matrix)
• Simple example (A, B and C are vectors)
C = (A * B)
• The operations can be executed in sequential or parallel steps
• Language supports array assignment
27
On Sequential Hardwares
28
On Data Parallel Hardwares
29
Data Parallel Architectures
• Early architectures directly mirrored programming model
• Single control processor (broadcast each instruction to an array/grid of
processing elements)
• Examples: Connection Machine, MPP (Massively Parallel Processor)
30
Data Parallel Architectures
• Later data parallel architectures
• Higher integration → SIMD units on chip along with caches
• More generic → multiple cooperating multiprocessors (GPUs)
• Specialized hardware support for global synchronization
31
SIMD: Graphics Processing Units
• The early GPU designs
• Specialized for graphics processing only
• Exhibit SIMD execution
• Less programmable
• NVIDIA GeForce 256
• In 2007, fully programmable GPUs
• CUDA released
32
Single-core CPU vs Multi-core vs GPU
33
Single-core CPU vs Multi-core vs GPU
34
NVIDIA V100 GPU
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
35
Specifications
36
CPUs vs GPUs
Chip to chip comparison of peak memory bandwidth in GB/s and peak double precision
gigaflops for GPUs and CPUs since 2008.
https://www.nextplatform.com/2019/07/10/a-decade-of-accelerated-computing-augurs-well-for-gpus
37
GPU Applications
38
Specifications
39
Multi-GPU Systems
https://www.azken.com/images/dgx1_images/dgx1-system-architecture-whitepaper1.pdf
40
Summary
• Parallel architectures are inevitable
• Different architectures are evolved
• Flynn’s taxonomy:
• SISD
• MISD
• MIMD
• SIMD
41
References
• David Culler, Jaswinder Pal Singh, and Anoop Gupta. 1998. Parallel Computer
Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA
• https://safari.ethz.ch/architecture/fall2020/doku.php?id=schedule
• https://www.cse.iitd.ac.in/~soham/COL380/page.html
• https://s3.wp.wsu.edu/uploads/sites/1122/2017/05/6-9-2017-slides-vFinal.pptx
• https://ebhor.com/full-form-of-cpu/
• Miscellaneous resources on internet
42
Thank You
43