Parallel computer architecture
classification
Basic Computer Architecture
• Von Neumann Architecture
– Uses Stored Program Concept
– Memory stores both program and data instructions
– CPU gets instructions and/or data
from memory
– decodes the instructions
– Computes sequentially
Hardware Parallelism
• Computing: execute instructions that operate on data.
Computer
Instructions Data
• Flynn’s taxonomy (Michael Flynn, 1967) classifies computer
architectures based on the number of instructions that can be
executed and how they operate on data.
Flynn’s taxonomy
• Single Instruction Single Data (SISD)
– Traditional sequential computing systems
• Single Instruction Multiple Data (SIMD)
• Multiple Instructions Multiple Data (MIMD)
• Multiple Instructions Single Data (MISD)
Computer Architectures
SISD SIMD MIMD MISD
SISD
• At one time, one instruction operates on one
data
• Traditional sequential architecture
SIMD
• At one time, one instruction operates on many data
– Data parallel architecture(ie., exploit data level parallelism)
– Vector architecture has similar characteristics, but achieve the parallelism
with pipelining.
– Performs same operation on multiple data
– Ex.: Adjusting audio of a digital video…
• Array processors, GPUs
Array processor (SIMD)
IP
MAR
MEMORY
OP ADDR MDR
A1 B1 C1 A2 B2 C2 A N B N CN
DECODER
ALU ALU ALU
MIMD
• Multiple instruction streams operating on multiple data
streams
– Classical distributed memory or SMP architectures
– have a number of processors that function asynchronously and
independently
– Ex: Intel Xeon Phi
MISD machine
• Not commonly seen.
• Systolic array is one example of an MISD architecture.
Flynn’s taxonomy summary
• SISD: traditional sequential architecture
• SIMD: processor arrays, vector processor
– Parallel computing on a budget – reduced control unit cost
– Many early supercomputers
• MIMD: most general purpose parallel
computer today
– Clusters, MPP, data centers
• MISD: not a general purpose architecture.
Flynn’s classification on today’s
architectures
• Multicore processors
• Superscalar: Pipelined + multiple issues.
• GPU: Cuda architecture
• IBM BlueGene
Modern classification
(Sima, Fountain, Kacsuk)
• Classify based on how parallelism is achieved
– by operating on multiple data: data parallelism
– by performing many functions in parallel: function
parallelism
• Control parallelism, task parallelism depending on the level of the
functional parallelism.
Parallel architectures
Data-parallel Function-parallel
architectures architectures
Data parallel architectures
• Vector processors, SIMD (array processors), systolic arrays.
IP
MAR
Vector processor (pipelining)
MEMORY
A B C
OP ADDR MDR
DECODER
ALU
Data parallel architecture: Array
processor
IP
MAR
MEMORY
OP ADDR MDR
A1 B1 C1 A2 B2 C2 A N B N CN
DECODER
ALU ALU ALU
Control parallel architectures
Function-parallel
architectures
Instruction level Thread level Process level
Parallel Arch Parallel Arch Parallel Arch
(ILPs) (MIMDs)
Pipelined VLIWs Superscalar Shared
Distributed
processors processors Memory
Memory MIMD
MIMD
Performance of parallel architectures
• Common metrics
– MIPS: million instructions per second
• MIPS = instruction count/(execution time x 106)
– MFLOPS: million floating point operations per second.
• MFLOPS = FP ops in program/(execution time x 106)
• Which is a better metric?
• FLOP is more related to the time of a task in numerical code
– # of FLOP / program is determined by the matrix size
Performance of parallel architectures
• Flops units(Floating Point Operations Per Second)
• Computer performance
• Name Abbr. FLOPS
• kiloFLOPS kFLOPS 103
• megaFLOPS MFLOPS 106
• gigaFLOPS GFLOPS 109
• teraFLOPS TFLOPS 1012
• petaFLOPS PFLOPS 1015
• exaFLOPS EFLOPS 1018
• zettaFLOPS ZFLOPS 1021
• yottaFLOPS YFLOPS 1024
Peak and sustained performance
• Peak performance
– Measured in MFLOPS
– Highest possible MFLOPS when the system does
nothing but numerical computation
– Rough hardware measure
– Little indication on how the system will perform in
practice.
Peak and sustained performance
• Sustained performance
– The MFLOPS rate that a program achieves over the entire run.
• Measuring sustained performance
– Using benchmarks
• Peak MFLOPS is usually much larger than sustained MFLOPS
– Efficiency rate = sustained MFLOPS / peak MFLOPS
Measuring the performance of parallel
computers
• Benchmarks: programs that are used to
measure the performance.
– LINPACK benchmark: a measure of a system’s
floating point computing power
• Solving a dense N by N system of linear equations Ax=b
• Use to rank supercomputers in the top500 list.
Other common benchmarks
• Micro benchmarks suit
– Numerical computing
• LAPACK
• ScaLAPACK
– Memory bandwidth
• STREAM
• Kernel benchmarks
– NPB (NAS parallel benchmark)
– PARKBENCH
– SPEC
– Splash
Memory architectures
• Shared Memory
• Distributed Memory
• Hybrid Distributed-Shared Memory
Shared Memory
• Shared memory parallel computers vary widely, but generally have in
common the ability for all processors to access all memory as global
address space.
• Multiple processors can operate independently but share the same
memory resources.
• Changes in a memory location effected by one processor are visible to
all other processors.
• Shared memory machines can be divided into two main classes based
upon memory access times: UMA and NUMA.
Shared Memory: Pro and Con
• Advantages
– Global address space provides a user-friendly programming perspective to
memory
– Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
• Disadvantages:
– Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
– Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
– Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors.
•
Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network
to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
• Because each processor has its own local memory, it operates independently. Changes it
makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization
between tasks is likewise the programmer's responsibility.
• The network "fabric" used for data transfer varies widely, though it can can be as simple
as Ethernet.
Distributed Memory: Pro and Con
• Advantages
– Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
– Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
– Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
• Disadvantages
– The programmer is responsible for many of the details associated with data
communication between processors.
– It may be difficult to map existing data structures, based on global memory, to
this memory organization.
– Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
Summarizing a few of the key characteristics of shared and
distributed memory machines
Comparison of Shared and Distributed Memory Architectures
Architecture CC-UMA CC-NUMA Distributed
Examples SMPs Bull NovaScale Cray T3E
Sun Vexx SGI Origin Maspar
DEC/Compaq Sequent IBM SP2
SGI Challenge HP Exemplar IBM BlueGene
IBM POWER3 DEC/Compaq
IBM POWER4 (MCM)
Communications MPI MPI MPI
Threads Threads
OpenMP OpenMP
shmem shmem
Scalability to 10s of processors to 100s of processors to 1000s of processors
Draw Backs Memory-CPU bandwidth Memory-CPU bandwidth System administration
Non-uniform access Programming is hard to
times develop and maintain
Software Availability many 1000s ISVs many 1000s ISVs 100s ISVs
Hybrid Distributed-Shared Memory
• The largest and fastest computers in the world today employ both shared and
distributed memory architectures.
• The shared memory component is usually a cache coherent SMP machine.
Processors on a given SMP can address that machine's memory as global.
• The distributed memory component is the networking of multiple SMPs.
SMPs know only about their own memory - not the memory on another SMP.
Therefore, network communications are required to move data from one SMP
to another.
• Current trends seem to indicate that this type of memory architecture will
continue to prevail and increase at the high end of computing for the
foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared and
distributed memory architectures.
Parallel Computers/Vector
Computers
• MIMD>SIMD>MISD
• Types
– Shared Memory Multiprocessors
– Message Passing Multicomputers
• Differences
– Memory Sharing
– IPC
Vector Processors(Array
Processors)
• SIMD
– Large Vector Input Processing
• Features
– Multiple Vector Pipelines
– Concurrently used under Firmware/Hardware
control
• Types
– Memory-to-Memory Architecture
– Register-to-Register Architecture
Parallel Computer Architecture
System Attributes
• Performances of a Computer System
– Machine Capability
• Better Hardware Technology
• Innovative Architectural features
• Efficient Resource Management
– Program Behaviour
• Application and Runtime
• Algorithm Design and DS
• Programming Language
• Compiler Technology
System Attributes to Performance
• Program Performance Metric
– Turn Around Time
– CPU Time
– Clock Rate and CPI
• Cycle Time
• Clock Rate
• Instruction Count
• CPI – Cycles Per Instruction
• Instruction Cycle
– IF, ID, OF, OD, EX
• Memory Cycle
– Time required to complete one memory reference
– It is k times the processor cycle
– k value depends on the Memory Technology,
Cache Memory Speed and CPU-Memory
interconnection mechanism
System Attributes
• The 5 Performance factors are influenced by 4
system attributes
– Instruction Set Architecture
– Compiler Technology
– CPU Implementation and Control
– Cache & Memory Hierarchy
System Attributes
• FLOPS
• Throughput
Implicit Vs Explicit Parallelism
Multiprocessors & Multicomputers
• Shared-Memory Multiprocessors
– UMA Model
– NUMA Model
– COMA Model
– CC-NUMA
• Distributed Memory Multicomputers
– NORMA
UMA Model
• Tightly Couple Systems
• Suitable for Time Sharing applications
• Symmetric MP vs Asymmetric MP
• MP vs AP
Summary
• Flynn’s classification
– SISD, SIMD, MIMD, MISD
• Modern classification
– Data parallelism
– function parallelism
• Instruction level, thread level, and process level
• Performance
– MIPS, MFLOPS
– Peak performance and sustained performance
References
• K. Hwang, "Advanced Computer Architecture :
Parallelism, Scalability, Programmability",
McGraw Hill, 1993.
• D. Sima, T. Fountain, P. Kacsuk, "Advanced
Computer Architectures : A Design Space
Approach", Addison Wesley, 1997.